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1  Introduction 


Since  the  publication  of  the  seminal  note,  Kullback  and  Leibler  (1951),  there  has  been 
continual  endeavor  in  statistics  and  related  fields  to  explicate  the  existing  statistical  methods 
and  to  develop  new  methods  based  on  the  logarithmic  information  of  Shannon  (1948).  There 
are  many  fine  collections  of  information-theoretic  methodologies  and  their  applications  to 
the  related  fields  such  as  Kullback  (1954,  1959),  Lindley  (1956),  Jaynes  (1957,  1968,  1982), 
Theil  (1967),  Akaike  (1973),  Gokhale  and  Kullback  (1978),  Shore  and  Johnson  (1980),  Kapur 
(1989),  Brocket!  (1991),  Cover  and  Thomas  (1991),  Csiszar  (1991),  Zellner  (1991),  Maasoumi 
(1993),  and  Soofi  (1994). 

During  the  last  four  decades  numerous  information  theoretic  regression  methods  have 
been  developed.  Kullback  and  Rosenblatt  (1957)  pioneered  the  information  theoretic  ap¬ 
proach  to  regression  by  explicating  the  usual  regression  quantities  such  as  sums  of  squares 
and  i^-ratios  in  terms  of  information  functions.  We  have  now  information  theoretic  meth¬ 
ods  for  model  and  predictive  density  derivation,  parameter  estimation  and  testing,  model 
selection,  collinearity  analysis,  and  influential  observation  detection  which  can  be  used  in 
sampling  theory  and  Bayesian  regression  analyses. 

The  logical  foundation,  elegance,  and  versatility  of  the  information  theoretic  approach 
have  been  increasingly  attracting  the  attention  of  researchers  in  various  fields.  However,  the 
available  entropy-based  methods  are  not  yet  commonly  used  in  the  mainstream  regression 
analysis.  Many  information-theoretic  regression  methods  are  developed  disjointly  in  the 
context  of  providing  alternatives  to  particular  problems  rather  than  as  integral  parts  of  a 
system  of  regression  analysis.  Information-theoretic  interpretation  of  many  of  the  available 
methods  and  the  relationship  among  them  have  not  yet  been  fully  explicated.  The  purpose 
of  this  paper  is  to  integrate  the  existing  entropy-based  methods  in  a  single  framework,  to 
explore  their  interrelationships,  to  elaborate  on  information  theoretic  interpretations  of  the 
existing  entropy-based  diagnostics,  and  to  present  information  theoretic  interpretations  for 
some  traditional  diagnostics. 
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2  Information  Functions 


In  this  section,  the  basic  information  functions  used  in  regression  analysis,  their  properties, 
interpretations,  and  relationships  with  the  Fisher  information  are  reviewed. 

2.1  Entropy 

The  entropy  of  a  continuous  random  variable  X  is  defined  as 

/OO 

f(x)  log  f{x)dx, 

‘OO 

where  f{x)  is  the  probability  density  function  for  the  absolutely  continuous  distribution  F. 

The  differential  entropy  may  be  negative  or  infinite.  Boundedness  of  /(a;)  implies  H{X)  > 
— OO  (Ash  1965,  p.  237).  For  a  distribution  with  finite  variance,  the  entropy  is  finite,  but 
the  converse  may  not  hold. 

The  conditional  entropy  is  obtained  by  using  the  conditional  density  in  the  entropy 
expression,  H{Y\x)  —  H[f{y\x)].  A  conditioning  may  increase  or  decrease  the  entropy.  The 
expected  conditional  entropy  is  defined  by 

H{Y\X)  =  E,[H{Y\x)\  <  H{Yy, 


the  equality  holds  if  and  only  if  the  two  variables  are  independent.  That  is,  on  average, 
conditioning  decreases  the  entropy. 

The  entropy  of  an  n-dimensional  random  variable  X  —  (Xi,  •  •  • ,  Xn)'  is  obtained  by  using 
the  joint  density  f{x)  in  the  entropy  expression.  For  the  joint  entropy  of  an  n-dimensional 
random  variable,  we  have 


H{Xi,  •  •  • ,  Xn)  =  f^H{Xi\Xi.i ,  •  •  • ,  Xi)  <  f;  H{Xi). 

t=l  i=l 

In  the  last  relation,  the  equality  holds  if  and  only  if  the  random  variables  Xi, •  •  •  ,X„  are 
independent. 

The  differential  entropy  is  not  invariant  under  one-to-one  transformations  of  X.  For  any 
continuous  random  variable  Z  =  g{X), 


H{Z)  =  H{X)  -  E 


log 


(2.1) 
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A  random  variable  X  with  a  location  parameter  /i  and  scale  parameter  a  may  be  written 
as  X  =  aZ  +  fi,  where  the  distribution  of  Z  is  independent  of  and  a.  Using  (2.1),  we  find 

H{X\^,a)  =  H{Z)  +  log(T. 

Thus  the  entropy  is  location  invariant  but  not  scale  invariant. 

H{f)  is  concave  in  /.  The  Maximum  Entropy  (ME)  model  f*{x\0)  is  the  density  that 
maximizes  H[f{x\6)]  subject  to  the  information  moment  constraints, 

Ef[cm{X)\0]  =  0^,  m  =  1, .  ■ . ,  M,  (2.2) 

where  Cm’s  are  integrable  with  respect  to  /  and  0  =  (0i,  •  •  •  ,9m)  is  the  vector  of  moment 
values.  The  moment  values  might  be  known  quantities  (e.g.,  computed  from  the  data)  or 
unknown  parameters.  The  ME  solution,  if  it  exists,  is  in  the  form  of 

r{x\0)  =  (2.3) 

where  C{0)  is  the  normalizing  constant  for  the  ME  density  and  7/^  =  ?/,„(0),  m  =  1,  •  •  • ,  M 
are  Lagrange  multipliers  for  enforcing  the  information  constraints  (2.2).  Multivariate  ME 
distributions  are  found  similarly. 

The  entropy  measures  the  “uniformity”  of  a  distribution  and  provides  a  measure  of  in¬ 
formation  in  the  following  sense.  H{X)  increases  as  the  concentration  of  probabilities  over 
subsets  of  the  support  of  the  distribution  decreases.  This  feature  makes  H{X)  a  suitable 
measure  of  uncertainty  associated  with  f{x).  The  term  uncertainty  describes  the  diflBculty 
of  predicting  an  outcome  a:  of  a  random  variable  X  with  the  probability  distribution  f{x). 
A  distribution  fi{x)  with  a  large  entropy  is  less  concentrated  (more  difficult  to  predict  its 
outcomes)  than  a  distribution  f2{x)  with  a  smaller  entropy.  Thus,  f\{x)  is  less  informative 
as  compared  with  f2{x).  Some  authors  have  interpreted  —H{X)  as  an  information  criterion 
in  the  context  of  developing  least  informative  probability  distribution;  see,  e.g.,  Zellner  1971) 
The  ME  distribution  is  the  least  informative  distribution  since  it  does  not  include  any 
information  that  is  not  explicitly  formulated  as  a  constraint  in  (2.2).  The  information  content 
of  each  moment  constraint  in  (2.2)  is  reflected  in  the  uncertainty  reduction  power  of  that 
constraint  (Jaynes  1968,  1982;  Soofi  1992,  1994).  Suppose  that  the  ME  distributions  exit 
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for  all  M  constraints  in  (2.2)  and  for  a  subset  of  6  constraints,  £  <  M.  Then  the  amount  of 
information  provided  by  the  additional  constraints  Ef{ce+i{X)\0]  =  Oe+i,  •  •  • ,  Ef[cM{X)\6]  = 
6m,  is  quantified  by  the  amount  of  entropy  reduction, 


0<£<M. 


Information  indices  are  constructed  by  mapping  the  entropy  reduction  to  the  unit  interval. 

The  information  index  oi  M  —  £  additional  constraints  on  a  continuous  ME  distribution 
may  be  computed  by  the  following  exponential  transformation  of  the  entropy  reduction 

■  ■  ■  ,c„)  =  1  -  0  <  f  <  M. 


An  /^(q+1,  •  •  •  ,Cm)  «  0  indicates  that  the  additional  constraints  are  redundant  for  concen¬ 
trating  the  probabilities.  An  /^(c^+i,  •  •  •  ,Cm)  ^  1  indicates  that  the  first  set  of  constraints 
are  redundant.  In  particular,  for  £  =  0,  the  ME  distribution  over  an  infinite  support  is 
improper  uniform  with  infinite  entropy,  thus,  /c(ci,  •  •  • , cm)  =  1- 

An  information  index  of  distributions  in  a  specific  class  is  defined  in  Section  2.2. 

Example  2.1 

(i)  The  n- variate  ME  density  subject  to  the  information  constraints 

E{X)  =  /i,  E{X  -ti){X-  /x)'  =  E  (2.4) 


is  the  n- variate  normal  A(/x,  E).  The  entropy  of  iV(/x,  E)  is 

H{X)  =  ^log{2ne)  +  ^log  |E|, 

where  |E|  denotes  the  determinant  of  the  covariance  matrix. 

(ii)  Consider  the  following  constraints  for  a  bivariate  random  variable,  X : 

Ci{X)=Xf,  E{Xf)  =  (T^,  i  =  l,2. 

The  ME  distribution  subject  to  these  constraints  is  the  bivariate  normal  = 

N{0,  (7^X2))  where  In  is  the  identity  matrix  of  order  n.  Note  that  since  the  constraint 
Ci{X),  i  =  1,2  only  include  information  about  the  marginal  moments,  the  ME  solution 
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is  independence  between  the  components  of  X .  That  is,  when  the  information  about 
a  relationship  between  the  components  of  a  random  variable  is  not  present  in  the 
information  constraints,  the  ME  solution  reflects  the  absence  of  a  relationship. 

(iii)  Now  consider  the  additional  cross-product  constraint 

C3(X)  =  X1X2,  E{X,X2)  =  p(t\ 


The  ME  distribution  subject  to  Ci,  C2  and  C3  is  the  bivariate  normal  distribution 

I 


p)=N 


0, 


V 


1  p 
p  1 


Hence,  the  ME  solution  reflects  the  information  specified  in  terms  of  the  correlation  in 
the  cross-product  constraint. 

The  entropy  reduction  due  to  C3  is 

Hinx-y)]- /.)i  =  -ltos(i-/)>o. 

The  partial  information  index  of  the  additional  constraint  is 


i^XiX,)  =  I  -  {I  - 


Thus,  for  example,  for  p  =  .6  the  uncertainty  reduction  is  20%  and  for  p  =  .8  the 
uncertainty  reduction  is  40%. 


2.2  Discrimination  Information 


The  most  widely  known  information  theoretic  measure  of  discrepancy  between  two  distri¬ 
butions  is  the  Kullback-Leibler  discrimination  information  function 

f{^) 


J<{f:g)  =  f  f{x)log 

J—oo 


-dx 


9{x) 

=  -H{f{x)]-Ef{log[g{X)]}. 


(2.5) 


The  discrimination  information  function  between  two  multivariate  distributions  is  defined 
similarly.  K{f  :  g)  is  well-defined  as  long  as  g{x)  =  0  only  if  f{x)  =  0. 
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K{f  :  g)  is  the  entropy  of  F  relative  to  G.  It  is  also  referred  to  as  the  cross-entropy 
between  the  two  distributions.  In  general,  there  is  no  relationship  between  K{f  .  g)  and 

H(s). 

If  f[x\0)  is  a  distribution  in  the  class  Clg  of  distributions  that  satisfy  (2.2)  and  / 
is  the  ME  distribution  in  Qg,  then  (Soofi,  Ebrahimi,  and  Habibullah  1995) 

K{f  :  r\e)  =  H[r{x\0)]  -  H[f{x\0)]. 

The  Information  Discrimination  (ID)  index  of  a  distribution  in  Qg  is  defined  by 

ID{f  :  f*\0)  =  1  - 

A  distribution  f  eUg  is  said  to  be  ID  distinguishable  with  the  ME  model  if 

iD{f:r\0)>iD{r:r\0)^o.  (2.6) 

The  properties  of  A'(/  :  g)  for  discrete  and  continuous  distributions  are  the  same.  Some 
properties  o^  K{f  :  g)  are  as  follows  (Kullback  1959): 

(i)  K{f  :  g)>0-,  the  equality  holds  if  and  only  if  f{x)  =  g{x)  almost  everywhere. 

(ii)  For  mutually  independent  random  variables  Xj,  •  •  •  ,Xn, 

n 

A:[/(xi,  •  •  •  ,a:n)  :  •  •  •  ,a;n)]  =  E  W(^<)  : 

i=l 

(iii)  For  any  two  random  variables  X  and  Y , 

K\f{x,y)  :  g{x,y)\  =  K[f{x)  :  g{x)\  +  {K{f{y\x)  :  g{y\x)]} 

=  Klfiy)  :  g{ij)\  +  Ey  {K\f{x\y) :  g{x\y)]} . 

Thus,  for  example,  K\f{x,y)  :  g{x,tj)]  >  K[f{x)  :  p(x)];  the  equality  holds  if  and 
only  if  the  expected  discrimination  information  between  the  respective  conditional 
distributions  is  zero. 

(iv)  Let  Y  =  T{X)  be  a  transformation  and  let  /y(y)  and  gyiy)  denote  the  distributions 
induced  by  T  on  fx{x)  and  gx{^)-  Then  Kify  •  9y)  ^  i^{fx  •  9x)  with  equality  if 
and  only  if 

fY{T{x))  ^  (2.7) 

9y{T{x))  9x{xy 

for  almost  all  x.  If  condition  (2.7)  holds,  T  is  a  sufficient  statistic  for  discrimination. 
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(v)  When  f{x\6)  and  g{x-,0),  Kify  ■  9y)  <  '■  9)  with  equality  if  and  only  if  T  is  a 

sufficient  statistic  for  9. 

(vi)  K{f  :  g)  is  convex  in  /  and  in  9. 

The  Minimum  Discrimination  Information  (MDI)  model  reference  to  a  distribution  g  is 
obtained  by  minimizing  K{f  :  g)  with  respect  to  /  subject  to  the  information  constraint 
(2.2).  The  MDI  density,  if  it  exists,  is  given  by 

r{x]g,e)  = 


When  K{f  :  g)  —  K{f  :  g,0)  where  6  is  the  unknown  parameter  of  one  of  the  distribu¬ 
tions.  The  parameter  may  be  estimated  by  an  MDI  procedure;  see  Section  4. 


Example  2.2 

(i)  The  discrimination  information  between  two  n-dimensional  normal  distributions  /  = 
E/)  and  g  =  Ep)  is 

K(J  ■■  g)  =  llTr(E,-E;'  -  (os|S/E;‘|  -  n|  +  -  >«,).  (2.8) 


where  Tr  denotes  the  trace  of  a  matrix.  The  first  term  in  (2.8)  gives  the  information 
discrepancy  due  to  two  different  covariance  structures  and  the  second  term  gives  the 
information  discrepancy  due  to  two  different  means.  For  E/  =  cr^In  and  Ep  =  (Tgin, 
(2.8)  gives 


K{f-.9)  =  l 


-  1 


-  Mp)'(M/  -  ^ig) 


(2.9) 


(ii)  Let  g{y]  y.g,  Ep)  =  N{fig,  Ep)  be  an  n- variate  normal  density.  Then  the  MDI  distribu¬ 
tion  reference  to  g  subject  to  the  mean  information  constraint 


Ef{Y)  =  ixj  (2.10) 

is  the  n- variate  normal  /*  =  proof  is  given  in  Soofi  (1985).  Thus,  the 

minimum  information  discrepancy  between  the  class  of  distributions  that  satisfy  (2.10) 
and  N{y,g,  Ep)is  given  by  the  second  term  in  (2.8). 
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2.3  Mutual  Information 

The  entropy  difference,  '0{Y\x)  =  H{Y)  -  H{Y\x)  measures  the  information  provided 
by  the  value  x  about  the  random  variable  Y .  A  particular  value  of  x  may  or  may  not  be 
informative  which  is  indicated  by  the  sign  of  '!9(T|aT).  The  mutual  information  between  two 
random  variables  is  defined  by 

i9(yAA:)  =  EMY\x)] 

=  H{Y)-H{Y\X) 

=  H{X)  -  H{X\Y) 

=  H{X)  +  H{Y)-H{X,Y).  (2.11) 

In  terms  of  the  discrimination  information,  the  mutual  information  is  given  by 

^{YAX)  =  K[f{x,y):f{x)f{y)]  (2.12) 

=  E:,{K[f{y\x)  :  f{y)]} 

=  Ey{K[f{x\y)  :  /(a;)]}. 

Thus  d{YAX)  =  i9(X  AT)  >  0  with  equality  if  and  only  if  f{x,y)  =  f{x)f{y).  Accordingly, 
i9(y  A  X)  is  a  measure  of  stochastic  dependency  between  the  two  variables. 

A  useful  normalization  of  i9(y  A  A”)  for  the  continuous  case  is 

Ic{Y  A  A)  =  1  - 

Ic{Y  A  A)  is  an  index  of  functional  relationship  between  the  two  variables.  It  generalizes 
the  correlation  coefficient;  see  Example  2.3,  part  (iii).  An  Ic{Y  A  A)  =  0  indicates  that 
two  variables  are  independent.  An  Ic{Y  AX)  =  1  indicates  that  the  two  variables  are 
functionally  dependent;  see  Joe  (1989)  for  details. 

'i?(y  A  A)  is  invariant  under  one-to-one  transformations  of  each  variable.  But  ^{Y  AX)  is 
not  invariant  under  rotation  of  the  coordinate  system  because  (2.7)  does  not  generally  hold 
under  rotations;  see  Example  2.3,  part  (ii). 

In  multivariate  case,  various  mutual  information  functions  may  be  obtained.  The  mutual 
information  between  the  components  of  a  p-dimensional  random  variable  A  =  (Ai,  •  •  • ,  Xp) 
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is  found  by  the  multivariate  extensions  of  (2.12)  or  (2.11)  as: 


i=i 

Mutual  information  functions  for  measuring  other  types  of  multivariate  dependencies  are 
found  similarly. 

The  mutual  information  between  a  random  variable  Y  and  a  p-dimensional  random  vector 
X  is  given  by 

■OiYAX)  =  Klf(y,x)  :  f(_y)f{x)] 

=  (2.13) 

j=l 

The  partial  mutual  information  function  il){Y  A  Xj\Xj^i,-  ■  •  ,2fi)  measures  the  conditional 
dependency  between  the  pair  {Y,Xj)  given  Xi,  -  ■  •  ,Xj-\.  In  general,  the  decomposition 
(2.13)  depends  on  the  order  of  the  variables  The  partial  mutual  information  may 

be  interpreted  as  a  measure  of  relative  importance  of  Xj  in  a  given  order. 

Example  2.3 

(i)  If  X  =  (Xi,  • '  •  fXpY  has  multivariate  normal  distribution  N{pL,  E),  then 

^  i=i  ^ 

=  iT.log  ajj  -  ^Y.log  Xe, 

^  i=i  ^ i=i 

where  djj  =  Var{Xj)  and  Xi  is  the  ^th  eigenvalue  of  E. 

(ii)  Let  W  =  rX  be  the  rotation  of  the  coordinates  of  X  by  the  matrix  F  of  the  eigen¬ 
vectors  of  E.  The  components  of  W  are  uncorrelated  and  Kar(Wf)  =  A^.  Thus 
i?(W  A  Wi,  •  •  • ,  Wp)  =  0  <  i9(X  A  Xi,  •  •  •  ,Xp),  with  equality  if  and  only  if  X/s  are 
uncorrelated. 

(iii)  If  (y,  Xi,  •  •  • ,  Xp)  are  jointly  normal,  then  H\Y\{xi,  - '  • ,  Xp)]  is  a  function  of  the  vari¬ 
ances  and  covariances,  and  is  functionally  independent  of  (rti,  •  •  • ,  Xp).  Thus  the  mutual 
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information  is  equal  to  the  entropy  difference 


i9[y  =  H{Y)-  H{Y\{xu---,Xp)] 

i=i 

where  p^{Y]Xi---  Xp)  is  the  square  of  the  multiple  correlation  between  Y  and  Xi,  •  •  • ,  Xp 
and  p^^Y^XjlXj-u  •  •  •  ,a;i)  is  the  square  of  the  partial  correlation  between  Y  and  Xj. 

The  partial  mutual  information  -log\1  -  p^{Y;Xj\xj-u-  -  ,Xi)Y^^  gives  a  measure  of 
relative  importance  of  Xj  in  regression  analysis  (Theil  1987,  Theil  and  Chung  1988). 

(iv)  The  normalized  index  of  dependency  is  Ic[Y  A  (Xi,  •  •  •  ,Xp)]  =  p^{Y]  Xi  •  •  •  Xp). 


2.4  Information  About  A  Parameter 


Quantification  of  uncertainty  about  predicting  an  outcome  of  a  random  draw  from  a 
distribution  f{x)  and  comparison  of  the  uncertainties  about  the  outcomes  of  two  probability 
distributions  /i  and  A  are  of  prime  interest  in  many  econometrics  problems.  Examples 
in  regression  analysis  include  comparing  the  uncertainties  associated  with:  the  prior  and 
posterior  distributions  of  the  coefficient  vector,  two  posterior  distributions  of  the  regression 
coefficient  vector  or  the  sampling  distributions  of  two  estimators  of  the  regression  coefficient 
vector  based  on  two  different  regression  matrix  structures,  etc. 

Traditionally,  the  variance  is  used  for  measuring  the  uncertainty.  The  widespread  use  of 
variance  for  measuring  uncertainty  is  rooted  in  the  statistical  estimation  (Fisher  1921).  In 


statistical  estimation,  Fisher’s  information  is  defined  as 


J^{e)  =  J^{f{x\e)]  =  -E.\e 


§^iogf{x\e) 


J’iff)  is  a  measure  of  information  in  X,  i.e.,  in  f{x\6)  about  the  parameter  9,  in  the  sense 
that  J^{9)  quantifies  “the  ease  with  which  a  parameter  ean  be  estimated”  by  x  (Lehmann 
1983,  p.  120).  Inherent  in  this  interpretation  is  the  facts  that:  (a)  X  is  an  unbiased  and 
efficient  estimator  of  9,  so  V{X\9)  =  \:F{9)]-\  and  (b)  under  f{x\9),  the  probabilities  are 
concentrated  around  the  mean  value  9. 
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From  the  information-theoretic  view  point,  the  Fisher  information  ^  is  a  second  order 
approximation  to  the  discrimination  information  function  K{fe  :  fe+Ae)  where  6  and  6  +  A9 
are  two  neighboring  points  in  the  parameter  space  and  the  two  distributions  fo  and  fe+AB 
belong  to  the  same  parametric  family. 

The  interpretation  of  variance  as  an  uncertainty  measure  about  the  prediction  of  an  out¬ 
come  of  a  random  draw  from  a  distribution  requires  caution.  Consider  two  random  variables 
X  and  Y  with  probability  distributions  fx  and  /y  on  the  same  support.  If  fx  is  flatter 
than  /y  which  assigns  high  probabilities  to  the  extreme  values  of  Y,  then  V{X)  <  V(Y)  ; 
®  fx  =  Beta{l.b,\.5)  and  /y  =  Beta{.b,.5)  are  Beta  distributions.  The  outcomes  of  y 
are  more  volatile,  but  easier  to  predict  than  the  outcomes  of  X.  Note  that  H{X)  >  H{Y). 
Ebrahimi  and  Soofi  (1996)  showed  that  for  many  well-known  parametric  families  of  distri¬ 
butions  the  variance  and  entropy  order  similarly  in  terms  of  the  distributions  parameters. 

The  interpretation  of  the  entropy  as  a  measure  of  uncertainty  about  an  unknown  pa¬ 
rameter  requires  cautions.  H[f{x\$)]  is  a  measure  of  uncertainty  about  an  outcome  x,  not 
about  6.  Sometimes,  a:  is  a  suitable  estimate  of  6,  e.g.,  when  0  is  a  location  parameter. 
Ebrahimi  and  Soofi  (1990)  interpreted  the  entropy  of  the  maximum  likelihood  estimator  of 
a  parameter  as  a  measure  of  information  about  the  parameter  being  estimated.  In  such 
cases,  information  about  x  may  be  interpreted  as  information  about  6.  Such  indirect  uses  of 
entropy  as  a  measure  of  information  should  be  interpreted  accordingly. 

In  Bayesian  statistics  involving  a  parameter  0,  the  information  about  the  parameter  is 
measured  by  a  discrepancy  between  the  posterior  and  prior  distributions;  see,  e.g.,  Goel  and 
DeGroot  (1979)  and  Goel  (1983).  The  difference  between  the  prior  entropy  H[k{6)]  and  the 
posterior  entropy  H\'n(6\x)\  measures  the  contribution  of  data  x  to  the  amount  of  uncertainty 
about  the  parameter;  see  Abel  and  Singpurwalla  (1994)  for  an  interesting  example. 

The  mutual  information  i9(©  A  X)  provides  a  measure  of  expected  information  in  data 
X  about  the  parameter  (Bindley  1956)  and  has  been  used  in  regression  problems;  see,  e.g. 
Stone  (1958),  and  Soofi  (1985,  1990).  For  Y  =  T(X),  i9(0  A  A")  >  i9(0  A  y),  with  equality 
if  and  only  if  T{X)  is  a  sufficient  statistic  for  9.  For  a  fixed  f{x\9)^  i?(0  A  X)  is  concave  in 
tv{9).  However,  maximization  of  i9(0A  A)  with  respect  to  x{9)  is  usually  intractable.  Bindley 
(1961)  showed  that  ignorance  between  two  neighboring  values  9  and  A9  in  the  parameter 
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space  implies  that  i9(0  A  X)  «  2{Ae)^J^{9),  T  being  the  Fisher  information.  According  to 
this  relationship,  Jeffreys  prior  for  9  is  an  approximate  solution  to  the  density  that  may  be 
obtained  by  maximization  of  i9(©  AX).  Bernardo  (1979a,  1979b)  developed  limiting  solution 
to  the  maximization  of  i9(0  A  X)  with  respect  to  7r(0).  Hill  and  Spall  (1987)  and  Spall  and 
Hill  (1990)  provided  approximate  solution  for  the  maximization  problem. 

Zellner  (1971)  defined  an  information  function  for  quantifying  the  information  in  the  data 
X  about  a  parameter  9  with  the  prior  7r(0),  which  may  be  written  as: 

G|ir(»)|  =  -  H\f{x\e)]}  (2.14) 

=  :  x(«)|} 

=  o(eAX)  +  w(e)-//(X). 

Zellner  proposed  G\t^{9)\  as  a  criterion  function  for  developing  prior  distributions  that 
are  maximally  committed  to  the  data.  The  prior  t^*(9)  that  maximizes  G[7r(0)]  is  referred  to 
as  the  Maximal  Data  Information  Prior  (MDIP).  The  first  equation  in  (2.14)  is  the  a  priori 
expected  information  in  the  data-generating  density  (likelihood  function)  which  is  purified 
from  the  information  in  the  prior.  The  second  equation  in  (2.14)  shows  that  G\e{9)]  is  the  a 
priori  expected  information  for  discrimination  between  the  data-generating  distribution  and 
the  prior.  The  third  equation  indicates  that  G\Tr{9)]  is  a  “broader”  information  criterion  for 
developing  prior  as  compared  with  19(0  AX).  Furthermore,  MDIP  gives  explicit  solutions  in 
many  problems  and  is  capable  of  including  side  information  in  terms  of  moment  constraints 
on  7r(0);  see  Zellner  (1991)  and  Soofi  (1996)  for  details. 


3  ME  Distributions  for  Regression 

In  this  section,  ME  distributions  for  the  error  terms,  coefficients,  and  precision  of  a  given 
linear  regression  are  presented.  An  ME  procedure  for  derivation  of  regression  function  is  also 
briefly  discussed. 
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3.1  ME  Distributions  for  Linear  Regression 

Consider  the  linear  equation: 

y  =  X(3  +  e,  (3.1) 

where  y  is  the  n  x  1  vector  of  observations,  X  is  an  n  X  p  full  rank  matrix  of  given  regressors, 
P  is  the  p  X  1  vector  of  regression  coefficients,  and  e  is  the  n  x  1  vector  of  error  terms. 

In  order  to  obtain  an  ME  distribution  for  the  error  term  for  inferential  purposes,  we  need 
to  specify  a  variation  function,  V(e')  >0  for  the  error  process.  The  maximum  mean  value 
of  the  variation  function  v,  signifies  the  degree  of  accuracy  and  its  inverse  ip  =  signifies 
the  precision  of  a  specified  regression  in  (3.1). 

Table  1  gives  examples  of  variation  functions  and  the  corresponding  ME  error  distribu¬ 
tions  obtained  using  (2.3).  As  shown  in  Table  1,  the  square  error  variation  gives  the  normal 
distribution  and  the  absolute  error  variation  leads  to  the  Laplace  (double  exponential)  error 
distribution  which  has  a  heavier  tail  than  the  normal.  The  logarithmic  variation  gives  the 
generalized  Student-t  distribution  for  the  error  terms  (Soofi  1996);  the  term  generalized  refers 
to  the  fact  that  the  degrees  of  freedom  parameter  v  may  not  be  an  integer  for  all  precision 
values  v{i/).  For  regression  analysis  with  the  Student-t  error  distribution  see  Zellner  (1976) 
and  Lange,  Little,  and  Taylor  (1989). 


Table  1.  Variation  Functions  and  Maximum  Entropy  Distributions  for  Regression  Error 


Variation  Function 

MaxE[V{e)] 

ME  Error  Distribution 

V(£)  =  e? 

i  =  1,  •  •  •  ,n 

Normal 

/•fe)  = 

a 

Laplace 

V(e)  =  log{u  +  e^), 

1/  >  1,  i  =  1, •  •  •  ,n 

Generalized  t 

ne,)  =  B(l,£)-V->/2(i  +  a)-(-+i)/2 

V(e)  =  ee' 

E 

Normal 

f*{£)  =  (27r)-"/2|E|-V2e-5®'l^l"'^ 

Notes:  V’(t^)  ==  ^'(m),  F  is  the  gamma  function;  B{u,v)  is  the  Beta  function. 
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Diagnostics  for  assessing  suitability  of  a  variation  function  as  the  description  of  the  error¬ 
generating  distribution  may  be  developed  using  the  ID  distinguishability  index  (2.6)  along 
the  lines  of  Soofi,  Ebrahimi,  and  Habibullah  (1995). 

Suppose  that  we  use  square  error  variation 

Ef{ei)<a‘^,  i  =  l,---,n.  (3.2) 

Then  the  ME  gives  the  following  multivariate  normal  model  for  the  vector  of  error  terms: 

r{e-,a^)  =  N{0,a^In),  (3.3) 

where  In  denotes  the  identity  matrix  of  dimension  n. 

The  independence  among  the  components  of  e  in  (3.3)  is  the  result  of  considering  solely 
the  marginal  variations  (3.2)  as  the  information  constraints  in  the  ME  computation.  If  there 
are  information  available  about  the  interrelationship  between  the  error  components,  they 
should  be  taken  into  account  by  formulating  appropriate  covariation  functions  (cross-product 
moments  constraints).  The  ME  solution  for  the  error  distribution  subject  to  covariation 
fimctions  and/or  nonhomogeneous  maximum  average  variations  across  the  n-dimension  will 
be  A^(0,E)  given  by  (2.4). 

Given  that  /3  and  X  in  (3.1)  are  not  subject  to  variation,  the  ME  error  distribution  (3.3) 
gives  the  conventional  normal  regression  model 

r(y,py)  =  N(xpyi„).  (3.4) 

In  the  ME  procedure,  the  simpler  moment  assumption  (3.2)  replaces  the  more  stringent 
assumption  of  normality  usually  made  in  the  traditional  regression  analysis.  But  as  we  have 
seen,  the  ME  procedure  is  versatile  in  producing  more  general  regression  models. 

Let’s  now  consider  variation  of  /3  in  (3.1).  Table  2  shows  examples  of  the  variation 
functions  for  the  regression  coefficients  pj  around  the  arbitrary  constants  trijij  —  !» *  ’  ■  jP- 

If  we  only  incorporate  the  range  of  variation  of  the  regression  coefficients,  then  the  ME 
solution  is  a  uniform  distribution  which  is  improper  when  a  =  b  =  oo. 

For  the  quadratic  variation  function  V(/?j)  =  {fij  -  nij)^,  the  information  constraints  are 
E{Pj  -  ruj)^  <T^,  i  =  1,  •  ■  •  These  constraints  give  the  ME  distribution 

m,  r^)  =  N{7n,  r^Ip),  (3-5) 
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where  m  =  (mi,  •  •  •  ^mp)'. 

The  ME  distribution  (3.5)  is  the  conjugate  prior  frequently  used  in  Bayesian  analysis  for 
/3|cr^  of  the  likelihood  function  (3.4). 

The  classical  random  effects  model  is  obtained  with  the  combination  of  (3.4)  and  (3.5) 
when  rrij  =  0  for  all  y  =  1,  •  •  •  ,p.  The  ME  distributions  such  as  (3.5)  developed  for  /3  are 
also  useful  for  modeling  heterogeneity  of  the  regression  coefficients  among  a  population  of 
interest  which  is  an  important  concern  in  some  fields  such  marketing. 

As  in  the  case  of  error  distribution,  the  lack  of  incorporating  covariation  information  in 
the  ME  computation  results  in  the  prior  independence  among  the  coefficients.  In  order  to 
incorporate  a  covariance  structure  ^  as  the  prior  information,  then  we  use  E(/3  —  »7i)'(/3  — 
m)  =  ^  as  the  constraints  in  the  ME  computation.  This  constraint  gives  multivariate 
normal  prior  shown  in  the  last  row  of  Table  2. 

For  example,  if  wish  to  use  the  data  covariance  structure  ^  =  (X'X)“^  and  oc  cr“^, 
then  we  obtain 

Tr{p- m,T^)  =  N\m,T^{X'X)-\  (3.6) 

which  is  the  p-prior  proposed  by  Zellner  (1982). 


Table  2.  Variation  Functions  and  Maximum  Entropy  Distributions  for  Regression  Coefficients. 


Variation  Function 

MaxElV(/3)] 

ME  Distribution  for  Coefficient 

V(/3j)  =  6(mj  —a</3j<  mj  +  b) 

1 

Uniform 

i  = 

r(/3)  =  (b-a)-P 

V(/?i)  =  (^i-m,)2 

y  =  l,...,p 

r2 

Normal 

f*{0)  = 

V(/3)  =  (/3-m)03-my 

7-2  ij/ 

Normal 

/*(/?)  =  (27rr2)-P/2|'Jr|-V2e-iJ7(^-"»)'’®'“^(/3-”i) 

Note:  6(’)  is  the  indicator  function: 


S(mj  —  a  <  fSj  <  nij  +  b)  = 


1  if  (3je{Tnj -a,  rrij  +  b),  y  =  l,---,p 
0  otherwise. 


15 


Next  I  incorporate  variation  of  the  precision  parameter  (p  —  a  ^  in  (3.4).  Because  (p  is 
positive  with  probability  one,  we  can  consider  the  types  of  information  constraints  shown  in 
Table  3  and  obtain  the  corresponding  ME  distributions.  Except  for  log  (p,  the  constraints 
shown  in  Table  3  may  interpreted  as  variation  functions;  log  ip  may  also  be  interpreted  as  a 
variation  function  if  P(^  >  1)  =  1. 

Like  for  the  case  of  the  regression  coefficients,  the  ME  distributions  derived  for  the  preci¬ 
sion  parameter  are  useful  in  the  Bayesian  and  frequentist  analysis.  The  uniform  distribution 
for  log  (p  is  the  Jeffreys  prior.  The  first  three  ME  distributions  shown  in  Table  3  are  spe¬ 
cial  or  limiting  cases  of  the  Gamma  distribution  which  is  ME  using  a  pair  of  information 
constraints.  In  Bayesian  analysis,  the  Gamma  distribution  is  the  conjugate  for  the  normal 
regression  model  (3.4).  The  Gamma  distribution  is  also  used  in  frequentist  analyses  for 
modeling  is  heterogeneity  of  the  regression  precision  among  individuals. 


Table  3.  Information  Constraints  and  Maximum  Entropy  Distributions  for  Regression  Precision. 


Information  Constraint 

MaxE\c{tp)] 

ME  Distribution  for  Precision 

c{(p)  =  6{a  <log  (p  <  b) 

1 

Uniform 

f*{p,)  =  (b-a)-^ 

c{p>)  =  (p 

a 

Exponential 

f*{(p)  =  ae“« 

c{(p)  —  log  (p,  (p  >  a 

i 

a 

Pareto 

f*{(p)  = 

ci{<p)  =  ^  , 

C2((/?)  =  log  (p 

1  p 

Gamma 

r(<p)  = 

Notes:  tp{u)  —  r'(u),  F  is  the  gamma  function.  ^(•)  is  the  indicator  function: 


6{a  <log  <p  <  b) 


I  1  if  log  ipG  (a,  b) 
I  0  otherwise. 
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3.2  ME  Regression  Functions 

A  regression  function  is  defined  by  the  conditional  expectation  and  is  given  by: 


y{x)  =  E{Y\x) 

=  J  yf{y\x)dy 


Jyfjy,  x)dy 

Sf{y,x)dy  ’ 


where  x  =  {a;i,  •  •  •  ,Xp)  is  the  vector  of  regressors  assumed  to  be  subject  to  variation.  Thus, 
in  principle,  one  can  find  the  ME  joint  distribution  f*{y,  x)  that  satisfies  a  set  of  information 
moment  constraints,  and  then  find  the  conditional  expectation  y{x). 

Ryu  (1993)  considered  the  special  case  when  y{x)  >  0  and  noted  that  y{x)  is  an  averaged 
density  with  respect  to  f{x)dx.  Ryu  showed  that  many  well-known  regression  functions  can 
be  derived  as  solutions  to 


max 

y 


J  y{x)log[y{x)]f{x)dx 


subject  to  constraints 


■  ’  (^mp{^p)y{^)  f  {^)dx  — 


4  Regression  Estimation  and  Prediction 

In  this  section  I  discuss  information  theoretic  procedures  for  estimating  regression  coef¬ 
ficients  and  developing  forecast  distributions. 

4.1  MDI  Estimation 

We  have  a  set  of  observations  j/i,  •  •  • ,  t/n  generated  from  an  n- variate  distribution  f(y).  Our 
objective  is  to  estimate  0  of  the  ME  distribution  f*(y;  0)  implied  by  the  linear  relationships 
(3.1)  and  the  associated  variation  function  V(e).  Here  0  denotes  the  vector  of  p  coefficients 
and  all  the  parameters  related  to  V(e). 

We  explicitly  differentiate  between  a  convenient  mathematical  function  f{y\ 0)  termed  as 
model  which  we  utilize  in  practice  as  an  approximation  to  the  unknown  true  data-generating 
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f{y)’  may  be  very  likely  that  the  true  distribution  is  in  fact  too  complicated  to  be 
represented  by  a  simple  mathematical  functions  such  as  is  given  in  ordinary  textbooks. 
(Sawa  1978). 

Given  linear  relation  (3.1),  the  variation  function  V(e),  and  0,  we  derive  the  ME  distribu¬ 
tion  f*{y]  0)  and  use  it  as  our  estimate  for  the  parametric  family  of  the  model  f{y]0)^  The 
symbol  f*  underscores  the  fact  that  the  ME  model  is  being  used  only  as  an  appToximation  for 
f{y).  According  to  the  entropy  concentration  theorem  (Jaynes  1982,  van  Campenhoimt  and 
Cover  1981)  and  the  ID  distinguishability  result  of  Soofi,  et  al.  (1995),  the  approximation 
should  be  satisfactory  if  /  is  a  ’Typical”  distribution  in  the  class  which  f*  is  the  ME  model; 
i.e,  if  /D (/:/*;  0)  ~  0,  thus  /  is  not  ID  distinguishable  with  /*. 

Therefore,  it  is  natural  to  estimate  the  model  parameter  0  based  on  a  criterion  that  im¬ 
proves  the  model  approximation  for  the  data-generating  distribution.  The  MDI  or  minimum 
relative  entropy  loss  estimation  procedure  serves  this  purpose.  [For  MDI  estimation  in  other 
contexts  see,  e.g.,  Kullback  (1959),  James  and  Stein  (1961),  Gokhale  and  Kullback  (1978), 
Haff  (1980),  Ghosh  and  Yang  (1988),  and  Soofi  and  Gokhale  (1991a).] 

The  loss  of  approximating  f{y)  by  an  ME  model  f*{y]0)  with  its  parameter  estimated 

by  0  is  measured 

2K\ny)  ■■  fiy-M  =  ^KViy)  •  f‘(yM 

Th 

The  MDI  or  minimum  relative  entropy  loss  estimate  Omdi  of  ^  parameter  6  is  defined 


by: 

^MDi  =  argmini('[/(j/)  :  r{y\0)\- 

9 

The  Bayesian  risk  of  approximating  J{y)  by  an  estimated  ME  model  f*{y',9)  is  computed 
by  the  posterior  expectation  2E{K\f{y)  :  f{y,  G)]\y}- 
The  MDI  Bayes  (MDIB)  estimate  of  0  is  defined  by 


0MDIB  =  argmin E{K[f{y  :  f{y,G)]\y}- 
0 

The  frequentist  risk  of  approximation  is  found  by  computing  the  expected  loss  with 
respect  to  the  sampling  distribution,  2EQ{K\f{y)  :  f{y,0)]}- 
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Decomposing  the  log-ratio  in  (2.5)  gives 

^r<\f(y)  :  fiv-M  =  2H,\nyM  -  2«l/{y)l,  (41) 

where 

H/\r{y,  9)1  =  --E,[log  r(y.  9)1-  (4.2) 

TX 

The  entropy  of  the  data-generating  distribution  is  free  of  0,  so  H[f{y)]  in  (4.1)  is  sometimes 
ignored,  the  loss  is  measured  by  expected  log-likelihood  (4.2),  and  the  MDI  estimate  may 
be  obtained  by 

Gmdi  =  argmax  Ef{log  f*{y,0)].  (4.3) 

0 

Here  the  cumbersome  problem  of  minimizing  the  information  discrepancy  between  the  un¬ 
known  data-generating  distribution  and  the  ME  model  is  reduced  to  the  simpler  problem 
of  maximizing  the  expected  value  of  a  log-likelihood  function.  But  unlike  the  conventional 
statistics  in  which  the  parameters  are  often  estimated  solely  by  considering  a  postulated 
model,  the  MDI  estimation  includes  both  the  model  and  the  data-generating  distribution. 
However,  at  this  point  the  problem  is  not  yet  completely  solved. 

Akaike  (1973)  proceeded  with  an  MDI  parameter  estimation  by  first  estimating  the  ex¬ 
pectation  in  (4.3)  using  the  empirical  distribution  which  assigns  a  mass  of  ^  to  each  data 
point  ^i,  i  =  1,  •  •  • ,  n.  In  this  case, 

1  ”  - 

0MDI  =  arg  max  -  V  log  f*{xji\  0)  =  0, 

6> 

where  0  is  the  Maximum  Likelihood  Estimate  (MLE)  of  0  under  the  model  /*.  Thus  from 
an  information  theoretic  view  point,  the  MLE  minimizes  an  estimated  information  discrep¬ 
ancy  between  the  data-generating  distribution  f{y)  and  the  ME  distribution  /*.  Akaike 
interpreted  this  approach  as  an  extension  of  the  MLE  principle. 

Under  the  assumption  that  the  data-generating  distribution  is  also  in  the  same  parametric 
family  as  f*{y‘,0),  Akaike  (1973)  estimated  the  information  discrepancy  for  0  by 

KAlriV,  9)  :  /•(!/;  9)1  =  (4.4) 
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The  consistency  of  the  MLE  implies  that  (4.4)  is  a  consistent  estimate  of  K[f*{y,6)  : 
/*(2/i^)]-  Akaike  computed  an  approximate  frequentist  risk  function  for  the  purpose  of 
model  selection  which  will  be  discussed  in  Section  5.2. 

Consider  estimation  of  the  normal  ME  regression  model,  —  N{XP,(T^In)- 

The  normal  model  is  plausible  for  approximating  the  data-generating  when  f{y)  possesses 
at  least  the  first  two  moments,  say,  /x  and  fi;  i.e.,  f{y)  =  fiy^fJ-,^)-  Note  that  the  normal 
ME  regression  model  uses  the  specific  form  /x  =  X/3,  but  the  data-generating  distribution 
f(y)  =  /X,  fi)  is  not  as  restricted. 

The  loss  of  approximating  /  {y,  /x,  Cl)  by  the  normal  regression  model  with  its  parameters 
estimated  by  and  is  given  by 

2Klf(y;  y,  ft)  :  /{y.  A  =  2H,[r{y;  -  2H|/(!/;  y,  n)|. 


For  the  case  of  0  =  the  expected  log-likelihood  can  be  evaluated  and  the  estimation 

loss  is  given  by 


-2H[f{y,  n,  a;)]  -t-  log{2'Ke) 


■\-log  +  To  + 


0-2  _  {y-Xp)'{y,-Xp) 


(4.5) 


Various  Bayesian  schemes  have  been  suggested  for  finding  the  risk  of  approximating 
f{y,fi,u“^In)  by  an  estimated  normal  regression  model.  Learner  (1979)  used  the  posterior 
distribution  of  (/x,oi2).  Expanding  the  quadratic  form  in  (4.5)  and  taking  expectation  gives 


K[f[y;li,u)  :f(y;^,i^)]  =  2fl|/(j/i  +  ioj(2ji-) 


,  ,  E(u,‘‘\y)  ,  E(y'y.) 


-f-  loga^+  ^o'— + 

-h  -^{^'X'XP  -  2E{fi'\y)Xp].  (4.6) 


The  MDIB  estimates  that  minimizes  the  expected  loss  (4.6)  with  respect  to  ^  and 


are: 


hp.B  =  (.X'X)-'X-E(y]y)  (4.7) 

=  E{i^‘‘\y)  +  -E{y'y\y)-lE(.t,.'\y)X{X'Xr'X'E{y\y).  (4.8) 

Tl  Ti 
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In  a  given  problem  these  MDIB  estimates  can  be  evaluated  by  using  the  unrestricted  ME 
distribution  f*{y,  =  N{yuP'In)  and  some  types  of  ME  priors  for  When 

the  prior  is  weak,  the  MDIB  estimates  (4.7)  and  (4.8)  are  approximately  equal  to  the  MLE 
of  /3  and  <7^  under  the  normal  ME  model, 


fiMDw  «  (X'XY'X'ymb  (4.9) 

»MD,B  «  Uv  -  Xb)\y  -  xi)  = 

7Z 

Sawa  (1978)  assumed  that  f{y)  =  Ar(/x,a;^J„),  and  defined  a  risk  function  in  terms  of 
the  posterior  distributions  of  the  parameters  of  the  normal  regression  model  f*{y',l3,<T^). 
Using  diffuse  priors  /3  and  cr^),  he  found  that 

>  109I.2T,)  +  loga^+  (’  +  ^) 

where,  is  the  mean  square  error  of  the  least  square  regression. 

Young  (1987)  defined  a  risk  function  as  K[f{y,  yu)  :  /(2/;^,d^)];  see 

Section  5.2. 

Sawa  (1978)  also  found  an  approximation  for  the  frequentist  risk  of  $  and  He  showed 
that  if  the  components  of  y  are  symmetrically  distributed  with  the  same  kurtosis  as  the 
normal  distribution,  then  the  frequentist  risk  of  P  and  is  approximately. 


K[f{y\y,u}) :  /(y;3,o-2)] 


2H[/(2/;//,u;)]  +Zo5(27re) 


+log  (Tq  + 


Pfc  +  1  (u‘^ 


n 


.4. 


1. 

n 


(4.11) 


The  quantity  Oq  is  defined  below. 

The  solutions  to  the  minimization  of  K\f{y\ii,Vl)  :  f*{y,P,(T'^)\  with  respect  to  the 
model  parameters  /3  and  are: 


y3o  =  {X'Xy^X'y 

al  =  -y!\I^-  X{X'X)-^X']y  +  J^. 
n 

These  quantities  are  referred  to  as  pseudo-true  parameter  values.  The  MLE  of  P  is  imbiased 
for  the  pseudo-true  parameter  value  and  the  MLE  of  cr^  is  asymptotically  unbiased  for 
the  pseudo-true  parameter  value  ctq,  Sawa  (1978). 
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4.2  MDI  Method  of  Moments 

The  MDI  moment  (MDIM)  estimate  of  0  is  defined  by  the  solution  to  the  following 
constrained  relative  entropy  loss  problem: 


min/<r[/(y)  :  r{y]0)] 

0 

subject  to 

j  Tm{y)f{y)dy  =  fj{y),  m  =  1,  •  •  • ,  M, 

where  Tj{y)  is  a  sample  moment  of  interest. 

As  an  specific  example,  we  construct  MDIM  estimates  for  the  parameters  of  the  normal 
regression  model  /*(?/;  =  A'(X/3,  cr^/n)  • 

Suppose  that  our  data  consist  of 


Vlki  j  ‘  ‘  ?  l/nkn  1 


fci  =  1, •  •  •  ,ni  >  1,  i  =  1, •  •  •  ,n. 


The  MDIM  estimates  of  ^  and  cr^  are  found  by  the  solutions  of: 

min  K{f  (y)  :  f*{y\ /3,  a^)]  (4.12) 

subject  to 

J  yif{y)dy  =  yu  i  =  i,  •  •  • . 

j {yi  -  yiff{y)dy  =  sf,  i  =  (4.14) 

where 

I  rii 
ki=l 

5i  =  —  (yiki  -  yif‘ 

^  fcj=i 

Assuming  /(y)  satisfies  the  regularity  conditions  required  for  taking  the  derivative  to  inside 
the  integral  sign,  the  MDIM  estimates  are  found  as: 


=  {X'xy^x’y  ^  (4.15) 

^MD,M  =  -W.-X{X'X)-'X']y-f-^-tcl  (4.16) 
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where  y  =  (yi,  •  •  • ,  Vn)'-  The  MDI  estimate  (4.15)  was  introduced  in  Soofi  (1985). 

In  the  MDIM  procedure,  the  unknown  values  /ij  are  estimated  by  iji  using  constraint 
(4.13).  This  is  in  line  with  the  common  practice  of  using  the  regression  estimate  of  yi  as 
the  point  estimate  of  the  conditional  expected  value  corresponding  to  Xi.  Then,  the  mean 
variation  function  in  each  dimension  is  estimated  by  sf  using  constraint  (4.14).  Finally, 
the  information  discrepancy  between  the  unknown  data-generating  distribution  and  the  ME 
regression  model  is  minimized. 

The  results  obtained  using  the  MDIM  procedure  is  akin  to  those  obtained  using  the 
conventional  techniques.  The  two  components  in  the  MDIM  estimate  of  the  error  variance 
are  related  to  the  well-known  quantities  in  regression  analysis.  The  first  term  in  (4.16)  is  the 
component  of  variance  due  to  lack  of  fit  of  the  regression  (3.1)  to  the  data  and  the  second 
term  is  the  component  of  variance  due  to  pure  error. 

For  the  case  of  a  single  observation  per  dimension,  rij  =  1,  Vi  =  Vi,  sf  =  0,  the  MDIM 
estimates  (4.15)  and  (4.16)  are  equivalent  to  the  usual  MLE  of  (3  and  Thus,  the  MDIM 
estimates  of  the  parameters  of  the  normal  regression  model  possess  all  the  properties  of  the 
MLE. 

The  relationship  between  the  MDIM  and  MLE  is  similar  to  a  duality  that  exists  between 
the  MLE  and  the  Internal  Constraint  Problem  (ICP)  formulation  of  Gokhale  and  Kullback 
(1978),  an  estimation  method  extensively  used  in  the  information-theoretic  analysis  of  con¬ 
tingency  tables.  In  the  ICP  formulation,  the  discrimination  information  function  between 
an  unknown  distribution  /  and  a  known  distribution  g,  K{f  :  g),  is  minimized  with  respect 
to  /  subject  to  constraints  (2.3)  with  the  information  moment  values  6j  obtained  from  the 
data.  When  g  is  uniform,  then  the  MLE  of  f*  and  the  MDI  estimate  of  /  are  equivalent. 
The  above  MDIM  procedure  is  similar  to  ICP  in  that  the  constraints  use  the  data  moments, 
but  in  (4.12),  the  reference  distribution  is  not  completely  known  and  is  not  uniform.  When 
the  reference  distribution  is  not  uniform,  the  equivalence  with  MLE  is  problem-specific,  thus 
may  not  always  hold. 

The  MDIM  procedure  is  also  in  line  with  the  approach  of  Sawa  (1978).  In  the  MDIM 
procedure,  (3  and  are  estimated  directly,  instead  of  first  developing  the  MDI  model  with 
the  pseudo- true  parameters  and  then  estimating  them  by  the  MLE. 


23 


The  usual  frequentist  inference  can  be  done  using  the  sampling  properties  of  the  MDIM 
estimates  (4.15)  and  (4.16)  under  the  normal  ME  model  —  N[X/3,  cr^In)-  The 

usual  Bayesian  inference  can  be  done  using  the  normal  ME  likelihood  function,  selecting 
a  prior  for  f3  from  the  ME  distributions  in  Table  2,  and  selecting  a  prior  for  the  precision 
parameter  from  the  ME  distribution  shown  in  Table  3. 

Application  of  MDIM  to  other  ME  error  distributions  such  as  those  shown  in  Table  1 
will  lead  to  new  regression  analyses.  The  use  of  other  ME  error  distributions  as  likelihood 
functions,  and  other  ME  priors  will  lead  to  new  Bayesian  regression  analysis. 

4.3  An  MDI  Predictive  Density 

Let  yj^\X\  be  an  mx  1  vector  of  forecasts  corresponding  to  the  m  new  vectors  of  explanatory 
variables  arranged  in  the  rows  of  Xm-  The  normal  ME  regression  model  (3.4)  implies  that 
the  ante-data  forecast  distribution 

r{y^\X„\;p,<T^)  =  N(X„pyi^).  (4.17) 

Since  the  parameters  are  unknown,  the  ante-data  forecast  distribution  (4.17)  is  not  usable. 
Several  frequentist  and  Bayesian  procedures  for  developing  predictive  distributions  free  of 
unknown  parameters  are  available,  see  Geisser  (1993). 

Many  of  the  known  Bayesian  and  frequentist  predictive  distributions  are  in  the  class 

S  =  L(yA,lA:j|0) :  s(yA,lx„||D)  =  k 

where  p(*)  is  a  density  and  /*■(•)  is  a  scaler  function  and  D  refers  to  the  observed  data  {X,y')] 
Levy  and  Perng  (1986)  and  Keyes  and  Levy  (1996). 

Levy  and  Perng  (1986)  considered  the  following  minimization  of  the  expected  discrimi¬ 
nation  information  function: 

mmEy[K(r(v^\XJ)  :  p(y„|X™||y))},  (4.18) 

where  the  expectation  is  with  respect  to  the  normal  ME  model  f*{y',P,cr^)  =  N{X(3,o^In) 
and  f*{yN{Xm])  is  the  ante-data  forecast  distribution  (4.17).  The  solution  is  the  m-dimensional 
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Student- ^  distribution  with  n  —  p  degrees  of  freedom, 

9*  Xmb,  +  Xm(X'X)-^X'j)  ,  (4.19) 

where  X^b  is  the  location  parameter  and  a‘^[Im  +  ^m{X'X)~^Xl^]  is  the  dispersion  matrix. 
Keyes  and  Levy  (1996)  extended  this  result  to  multivariate  linear  models. 

The  predictive  distribution  (4.19)  is  the  one  obtained  in  Bayesian  regression  based  on 
the  JeflFreys  prior  7r(/3,  cr^)  oc  l/cr^  and  the  normal  ME  likelihood.  This  coincidence  is  due 
to  the  fact  that  in  (4.18)  the  objective  is  to  find  the  predictive  density  which,  on  average, 
has  the  least  information  discrepancy  with  the  ante-data  ME  density  f*{yf,r\Xm])-  That  is, 
in  (4.18),  we  search  for  the  member  of  Q  which  is  closet  to  the  least  informative  ante-data 
density  and  we  find  the  one  based  on  the  Jeffreys  non-informative  prior  as  the  solution. 

4.4  Bayesian  Method  of  Moments 

The  Bayesian  Method  of  Moments  (BMOM),  recently  proposed  by  Zellner  (1994),  combines 
the  use  of  sample  moments  and  ME  procedure.  The  BMOM  combines  the  least  square  with 
the  ME  procedure  and  produces  posterior  (conditional  on  data)  results  without  a  need  for 
introducing  likelihood  functions  and  prior  densities;  i.e.,  the  BMOM  bypasses  the  Bayes 
Theorem. 

Zellner  considered  the  linear  equation  (3.1)  in  which  y,  X  and  /3  are  defined  as  before, 
and  £  =  u  is  the  vector  of  the  realized  error  terms.  The  data  D  =  {y,X)  is  given,  thus  the 
quantities  X  and  y  are  not  subject  to  variation.  But  the  quantities  f3  and  u  are  unknown 
and  subject  to  variation. 

The  posterior  means  of  f3  and  u  are  obtained  based  on  the  first  moment  assumption  of 
BMOM: 

X'E{u\D)  =  0.  (4.20) 

Note  that  if  (3.1)  includes  an  intercept  term,  then  E(u\D)  =  n~^  E(ui\D)  =  0.  Taking 
the  expectation  of  (3  and  u  in  (3.1)  gives 

E{I3\D)  =  {X'Xy^X'y^b  (4.21) 

E{u\D)  =  y-Xb  =  u,  (4.22) 
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where  u  is  the  vector  of  least  square  residuals. 

The  use  of  (4.21),  (4.22),  and  (4.20)  gives 

Var{p\D)  =  E{l3-b){(3-by 

=  {X'X)-^X'E{u  -  u){u  -  u)'X{X'Xy  (4.23) 

where  the  covariance  structure  of  it  is  a  solution  to  the  functional  equation, 

E{u  -  u){u  -  u)’  =  X{X'X)-'^X'E{u  -  u)(u  -  uyX{X'X)-^X'. 

Zellner  proposed  the  following  solution  to  the  functional  equation  as  the  second  moment 
assumption  of  BMOM: 

Var{u\D,  =  E{{u  -u){u-  uy\a^]  =  X{X'X)X'a^.  (4.24) 

Using  (4.24)  in  (4.23)  gives  the  posterior  covariance  matrix  of  /3  conditional  on  as 

Var{(3\D,a^)  =  {X'X)-^a^. 

When  (3.1)  includes  an  intercept  term,  some  algebraic  manipulations  of  (4.24)  gives  the 
posterior  expectation  of  <7^  as  the  mean  square  error  of  the  least  square  regression, 

n  —  p 

The  forecast  for  a  new  vector  is  given  by  yNl^m]  —  4"  Thus  as  a  function  of 

/3  and  the  forecast  is  subject  to  variation.  Conditional  on  (t^,  the  mean  and  variance  of 
the  forecast  are: 

E{yN\Xm\)\D,(T^)  =  x'J} 

Var{yN\Xm\)\D,a‘^)  =  \\  +  x'^{X'X)-^XmW. 

Posterior  and  predictive  distributions  of  various  quantities  of  interest  for  BMOM  regres¬ 
sion  analysis  are  shown  in  Table  4.  The  normal  and  exponential  distributions  are  obtained 
directly  by  the  ME  procedure  based  on  the  BMOM  derived  in  terms  of  the  data.  The  Laplace 
(Double  Exponential)  distributions  are  derived  by  integrating  out  from  the  joint  density 
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Table  4.  Maximum  Entropy  Posterior  and  Predictive  Distributions  for  BMOM  Regression. 


Quantity 

ME  Distribution 

Mean 

Variance 

0\D,  <7^ 

Normal 

b 

{X'Xy'^a^ 

Ui  1  D, 

Normal 

Ui 

mm 

Normal 

[l+x'^{X'X)-^x^]a^ 

0-2  1  D 

Exponential 

/5j|B 

Laplace 

Sj  =  (j,j)th  element  of  (X'X)“^s^ 

ep  1 D 

Laplace 

£’b 

£'{X'X)-Hs\  ^'  =  (^:,...,4) 

Ui  1  D 

Laplace 

x[{X'X)-^Xis'^ 

Laplace 

|1  +  x'JX'X)-^x^\s^ 

Notes:  D  refers  to  the  data  (X,y).  The  density  of  the  Laplace  (Double  Exponential)  distribution 
with  mean  u  and  variance  uP  is: 

f{z\u,u})  = 

y/zut 

given  by  the  product  of  the  respective  normal  conditional  density  and  the  exponential  density 
for  cr^|Z). 

The  Laplace  predictive  distribution,  derived  based  on  BMOM,  generally  gives  wider 
intervals  than  those  obtained  using  the  normal  and  the  Student-f  predictive  distributions 
found  in  the  conventional  Bayesian  and  frequentist  regression  approaches. 

4.5  ME  Estimation  With  Undersize  Sample 

An  undersize  sample  refers  to  the  situation  when  n  <  p  in  the  linear  relationship  (3.1). 
In  the  science  and  engineering  fields  related  to  image  reconstruction,  the  problem  is  referred 
to  as  ill-posed  inverse  problem  and  ME  inversion  method  is  available  to  solve  the  problem 
(Gull  and  Daniell  1978,  Skilling  and  Bryan  1984,  Gull  1989).  In  the  ME  inversion  method, 
the  image  is  a  high  dimensional  vector  of  positive  elements  that  are  reconstructed  based 
on  the  noisy  data  vector  y  which  has  a  dimension  much  lower  than  the  rank  of  the  linear 
operator  X;  i.e.,  n  <^p. 
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The  ME  inversion  method  solves  the  ill-posed  problem  using  the  following  formulation: 

max-^f3jlog{Pj)  subject  to  {y  -  XP)'{y  -  X/3)  <  6.  (4.25) 

P  i=i 

The  ME  estimate  is  given  by 

^Msiv)  =  arg min  Pjlog{Pj)  +'ni{y-  X(3)'{y  -  X/3).  (4.26) 

P  j=i 

The  solution  is  found  using  an  optimization  routine.  Skilling  and  Bryan  (1984)  have  devel¬ 
oped  special  routine  for  the  ME  inversion  method. 

Strictly  speaking,  Pjiog{l3j)  in  (4.25)  is  not  a  bona  fide  entropy  because  the  normal¬ 
izing  constraint  YJj=\  =  1  is  not  included.  But  this  causes  no  problem  since  the  solution 
of  (4.26)  are  positive  and  can  be  normalized  if  so  desired. 

In  the  traditional  terms,  the  solution  to  the  dual  of  (4.25)  is  the  following  constrained 
(regularized)  least  square  estimate 

p 

bMEiv)  =  bisiv)  =  arg  min  {y  -  X(3)'{y  -  Xf3)  -f  r/2  XI 

P  i=i 

The  solution  depends  on  the  parameter  y  =  ^{0)  which  may  be  chosen  based  on  some 
statistical  criteria  such  as  cross-validation;  for  more  detail  see  Donoho  et  al  (1992)  and 
discussions  following  that  article. 

Next  I  discuss  two  ME  estimation  methods,  proposed  by  Theil  and  Laitinen  (1980)  and 
Vinod  (1982),  that  avoid  singularity  of  X'X  when  n  <  p.  These  methods  are  based  on 
viewing  the  rows  of  X  in  (3.1)  as  samples  from  a  p-dimensional  random  variable  x. 

Under  the  assumptions  that  j3  is  not  subject  to  variation,  X  is  subject  to  variation,  and 
E{X'e)  =  0,  the  regression  coefficient  is  given  by 

where  Ea,  =  [ajk]  is  the  covariance  matrix  of  the  explanatory  variables  Xi,--*,Xp  and 
(Tyx  =  {o^y,i  ^y,py  vector  of  the  covariances  of  y  with  Xk-,  A;  =  1,  •  •  •  ,p. 

In  the  case  of  random  regressors,  X'X  and  X'y  are  estimates  of  the  cross-product  mo¬ 
ments  E{xx')  and  E{xy)  obtained  by  the  sample  second  order  moments. 
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Theil  and  Laitinen  (1980)  proposed  estimating  Si  by  the  second  order  moments  of  the 
ME  distribution  that  they  developed  for  x  by  assuming  that  F{x)  is  continuous.  Let  x]  < 
Xj<-‘-<Xj  denote  the  order  statistics  for  the  sample  Xji,  •  •  •  ,Xjn,  and  let  the  intermediate 
points  be  defined  as  x*-  <  i  =  1.  The  set  of  all  intermediate  points 

partitions  into  rF  regions.  The  partitions  are  either  bounded  hyper-rectangular  regions 
with  sides  given  by  the  interval  segments  connecting  the  pairs  i  =  1,  •  •  •  ,n  —  1,  or 

semi-bounded  hyper-rectangular  regions  that  have  open-ended  intervals  of  types  (— oo,  ^j] 
and/or  oo)  for  a  number  of  their  sides.  There  are  n  regions  /?i,  •  •  • ,  /?n  each  containing 
one  data  point  Xi  and  all  other  regions  are  empty.  Theil  and  Laitinen  constructed  the  ME(^) 
distribution  of  x  subject  to  the  mass-preserving  constraint 

f  f{x)dx=-,  z=l,---,n,  (4.27) 

J  Ri  TL 

and  the  mean-preserving  constraint 

j  Xjf{x)dx  =Xj  j  =  I,  ■■•,p.  (4.28) 

The  constraints  (4.27)  and  (4.28)  produce  a  p-variate  ME(^)  distribution  F*  with  the 
following  properties: 

(i)  The  ME(^)  density,  f*{x)  >  0  if  and  only  if  x  €  for  an  i  =  1,  •  ■  • ,  n. 

(ii)  The  ME(^)  distribution,  F*{x)  is  the  product  of  piecewise  uniform  marginals  when 
x  e  Ri,  Ri  is  a.  bounded  hyper-rectangular  region.  F*'{x)  is  the  product  of  uniform 
and  exponential  marginals  when  x  E  Ri,  Rt  is  a  semi-bounded  hyper-rectangular 
region. 

(iii)  The  mean-preserving  constraint  makes  the  intermediate  points  to  be  the  primary 
midpoints  =  x)  =  5(x’  -l-rr^'^^),  i  =  1,  •  •  •  ,n  - 1.  The  mean  of  the  individual  intervals 
are  given  by  the  secondary  midpoints 

\x)  -\-\x]  fori  =  l 

E{Xj\X  ERi)  =  ^  m  +  /or  i  =  2,  •  •  • ,  n  -  1 

\x]~'^  +  ^x]  fori  =  n, 
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where  =  Xj  and  =  x'j.  Thus  the  sample  mean  Xj  is  also  the  mean  of  the  secondary 
midpoints,  f],  •  •  • , 5”. 

(iv)  The  covariance  matrix  of  f*{x)  is  S*  with  elements  defined  by 

s*k  =  -f^{xji-xj){xki-xk)  (4.29) 

[Bei  -  + 2(ej  -  fi? + m  -  .  (4-30) 

where  Bji  are  the  secondary  midpoints  rearranged  according  to  the  original  data  index 
and  6ij  is  the  Kronecker  delta.  The  ME(^)  variance  Sjj  is  smaller  than  the  sample 
variance  and  the  amount  of  shrinkage  is  given  by 


The  ME(^)  covariance  matrix  may  be  written  as  5*  —  S  where  S  is  the  covariance 

matrix  of  the  secondary  midpoints  whose  elements  are  given  by  the  expression  (4.29)  and 
denotes  the  diagonal  matrix  with  elements  shown  in  the  expression  (4.30).  The  ME(^) 
covariance  S*  is  positive  definite. 

The  ME(^)  estimate  of  the  linear  regression  coefficient  is 

=  (S  +  (4.31) 

where  s*^  is  the  vector  of  ME(^)  covariances  between  y  and  the  explanatory  variables 
computed  using  the  secondary  midpoints  and  Xji.  The  last  expression  in  (4.31)  shows 
that  busiO  computed  as  a  ridge  estimate  of  the  secondary  midpoints.  The  ridge 

values  are  given  by  the  secondary  midpoints  as  shown  in  (4.30).  Because  of  this  ridge 
structure,  b^EiO  can  be  computed  for  undersize  samples. 

Meisner  (1980)  compared  the  risks  of  bME{0  ordinary  least  square  estimate  b 

under  the  quadratic  loss  when  the  data  is  generated  from  a  multivariate  normal  distribution. 
Because  of  complications,  he  only  considered  n  =  2  andp  =  2, 3.  For  p  =  2,  b^EiO  compares 
favorably  with  b  over  most  of  the  parameter  space.  For  p  —  3,  b  does  not  exist. 
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Vinod  (1982)  developed  ME(d)  estimate  for  linear  regression  coefficient  using  the  above 
procedure  but  relaxing  the  assumption  of  continuity  for  F(x).  He  formulated  the  problem 
in  terms  of  the  data  being  subject  to  rounding  errors  of  magnitudes  dj  >  0.  Thus  F{xj)  is 
locally  continuous  in  the  neighborhood  Vi  of  the  data  points  defined  by  =  Rn  x  •  •  •  x  Rpi 
where 

Rji  ~  \^ji  dj^  ^ji  d”  dj  ^0,  J  I5  *  *  ■  jPj  i  1,  •  •  •  5  TL. 

The  mass  preserving  constraint  is  given  by 

[  f{x)dF{x)=^K  i  =  (4.32) 

JVi  n 

The  ME(d)  distribution  subject  to  the  mass-preserving  constraint  (4.32)  and  the  mean¬ 
preserving  constraint  (4.28)  is  the  product  of  imiform  marginals  if  a;  G  V^.  For  dj  =  0,  F*{xj) 
is  the  usual  empirical  distribution. 

The  covariance  matrix  S**  of  the  ME(d)  has  much  simpler  structure  than  that  shown  in 
(4.29)  and  (4.30)  for  S*.  The  elements  of  S**  are  given  by 

^  i=i 

where  6jk  is  the  Kronecker  delta. 

The  ME(d)  covariance  matrix  may  be  written  as  S**  —  Sx  +  Dd  where  Sx  is  the  matrix 
of  the  usual  sample  second  order  central  moments  Sjk,  and  Dd  is  the  diagonal  matrix  with 
elements  d^  in  the  diagonal.  Thus,  S**  is  positive  definite.  Note  that  ~  j  ^ 

The  ME(d)  estimate  of  the  linear  regression  coefficient  is 

^Msid)  =  S**  ^SyX  =  {Sx  +  Dd)  ^SyX, 

where  Syx  is  the  vector  of  usual  second  order  central  moments  between  y  and  the  explanatory 
variables.  Thus,  Syx  also  has  a  ridge  structure  and  may  be  computed  for  undersize  samples. 
The  ridge  values  are  given  by  the  magnitudes  of  the  rounding  errors.  When  the  variables 
are  not  subject  to  measurement  error,  h^Eid)  reduces  to  the  ordinary  least  square  estimate. 
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5  Discriminating  Between  Alternative  Models 


In  this  section  I  consider  the  problem  of  discriminating  between  alternative  linear  rela- 
tionships 

Mfc  :  2/  =  A:  =  1, 2,  •  •  •  (5.1) 

where  y  and  £k  are  the  rr  x  1  vectors,  Xk  is  n  y.  Pk  full  rank  matrix,  and  /3^  is  Pk  x  !• 

The  issue  of  regression  models  being  nested  or  non-nested  often  arises  in  discussions  of 
discriminating  between  alternative  models.  Pesaran  (1987)  operationalized  the  concept  of 
nested  and  non-nested  hypotheses  in  terms  of  the  discrimination  information.  As  an  example, 
Pesaran  discussed  the  issue  for  regression  models  using  the  usual  normality  assumption  and 
the  concept  of  “true”  parameter.  I  adapt  Pesaran ’s  approach  and  discuss  the  issue  along  the 
lines  of  the  ME  and  MDI  developments  of  the  previous  sections. 

Let  Vfc(£fc)  be  the  variation  function  for  the  error,  ipk  be  the  corresponding  precision 
parameter,  and  /^(efc)  be  the  implied  ME  model  for  the  error  term  in  M*,.  If  Ok  =  {P'k^^k) 
and  X  are  not  subject  to  variation,  then  each  ME  model  for  the  error  term  in  (5.1)  implies 
an  ME  distribution  fk{y]Xk,Ok)  for  y  under  M*. 

A  model  Mk  is  said  to  be  nested  in  the  model  Mi,  denoted  by  Mk  di  Mi,  if  and  only  if 

K{0to,eh^t,x,)  s  -  min  K\n{y;X,M  :  f;(y;X,,0,)]  =  0  (5.2) 

n  Oteet 

for  all  admissible  pseudo- true  parameter  values  Oko  in  the  parameter  space  G*  of  M^.  If 
(5.2)  holds  for  some  but  not  all  admissible  pseudo-true  parameter  values,  then  Mk  is  said  to 
be  partially  non-nested  with  respect  to  Mi.  If  (5.2)  does  not  hold  for  any  admissible  pseudo- 
true  parameter  value,  then  the  Mk  is  said  to  be  globally  non-nested  with  respect  to  Mi.  If 
Mk  d  Ml  and  Mt  d  Mk,  then  the  two  models  are  said  to  be  observationally  equivalent. 

If  we  use  quadratic  variation  function  and  ^Pk  =  (^k^i  then 

Mk  :  rk{y\Xk,PkA)  =  N{Xk^kAln).  (5-3) 


Using  (2.9)  with  Hf  =  Xk^k  and  fig  =  Xifii,  and  minimizing  with  respect  to  (/3f,  aj)  gives 
(31  =  {X'iXir^X'iXk(3ko 

of  =  (Tl  +  {XkM'[ln-Xi{X[Xi)-^X'^XkPko-  (5-4) 
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These  MDI  parameters  give 


^ (/^fc05  ~  2^^^  ( (7^  ) 

Therefore,  M*,  :<  Mi  if  and  only  if  aj*  =  afo.  That  is,  the  second  term  in  (5.4)  is  identically 
equal  to  zero,  which  holds  when  (3i^q  =  0  or  when  \I n— Xi[X'iXi)~^ X'^X^  =  0.  The  first  case 
is  trivial.  The  second  condition  implies  that  Mk  Me  if  Xk  is  in  the  linear  space  spanned 
by  the  columns  of  Xi.  The  usual  case  of  Xk  being  a  submatrix  of  Xi  is  in  accord  with  this 
result.  Also  in  accord  with  this  result  is  the  case  in  which  columns  of  Xk  are  constructed  by 
linear  transformations  (e.g.,  principal  components)  of  a  subset  of  the  columns  of  Xi.  When 
all  columns  of  Xe  are  also  linear  functions  of  all  columns  of  Xk,  then  Me  ■:<  Mk  and  the  two 
models  are  observationally  equivalent.  This  is  the  case  when,  for  example,  Xk  is  the  matrix 
of  all  the  principal  components  of  Xi. 

In  general,  the  normal  linear  regression  models  (5.3)  are  either  nested  or  partially  non¬ 
nested,  but  not  globally  non- nested  (see  Pesaran  1987  for  more  detail).  This  is  not  necessarily 
true  when  the  error  variation  function  for  one  of  the  alternative  models  is  not  quadratic.  For 
example,  consider  discriminating  between  the  following  models: 

Mfc  :  j/  =  Xk(3k  +  ek,  Vk{£k)  =  kwl 

Me-.  y  =  Xe(3e  +  £e,  Ve{£e)  =  4- 

From  Table  1  we  find  that  the  ME  error  distribution  for  the  absolute  error  variation  function 
is  Laplace.  The  implied  ME  distribution  for  y  under  Mk  is  Laplace  with  mean  Xk/3k  and 
variance  It  can  be  shown  (using  equation  (5.3)  of  Soofi  and  Gokhale  1991a)  that 

X{(ikoi^‘kQiPti^"i  \Xk,Xe)  =  -log  ~  ’ 

where  u]*  =  (tJ*  is  given  by  (5.4).  Therefore,  Mk  Me  if  and  only  if  ujf* /gIq  =  c/tt.  But 
this  is  impossible  because  uj*  >  4o  by  (5.4)  and  e  <  tt.  Thus,  Mk  is  globally  non-nested 
with  respect  to  Me.  Pesaran  (1987)  showed  a  similar  result  for  the  case  when  the  distribution 
of  y  under  Mk  is  lognormal. 
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5.1  MDI  Tests 


Information  theoretic  testing  of  linear  hypotheses  regarding  a  regression  coefficient  vector 
was  considered  by  Kullback  and  Rosenblatt  (1957).  They  followed  the  common  practice  of 
assuming  normality  of  y  under  various  hypotheses  and  explicated  the  usual  F  statistic  in 
terms  of  the  discrimination  information  function.  But  the  problem  can  be  approached  using 
the  ME  and  MDI  developments. 

Consider  the  problem  of  discriminating  between  two  linear  relationships  Mi  and  M2  in 
(5.1)  when  Xk  =  X,  A:  =  1, 2,  /Sj  is  known,  and  ^2  is  unknown.  Based  on  quadratic  variation 
function  with  cr|  =  cr^,  A:  =  1,2,  we  find  two  ME  models  for  Mi  and  M2  as  shown  in  (5.3). 
In  this  case,  the  two  models  are  observationally  equivalent. 

We  evaluate  (2.8)  for  the  two  ME  models  and  find  the  discrimination  information  function 

■fl3^,^)= - 2^5  •  (5.5) 


This  is  the  information  discrepancy  due  to  the  two  different  means  of  the  two  multivariate 
normal  ME  distributions  under  Mi  and  M2. 

An  information  statistic  is  found  by  estimating  the  unknown  parameters  in  (5.5).  Kull¬ 
back  and  Rosenblatt  (1957)  suggested  replacing  the  unknown  parameters  /Sg  and  in  (5.5) 
with  their  best  unbiased  estimates,  the  ordinary  least  square  estimates.  Using  b2  =  b  in 
(4.9)  and  the  variance  estimate 

^2  ^  (y  -  xb2y{y  -  xb2)  ^  ^5  gj 

^  n  —  p 

gives  the  information  statistic 


{b,-p,)'X'X{h-p,) 

—  - "2  —  P^p,n-p- 

S2 

As  indicated  in  the  last  expression,  the  discrimination  information  statistic  follows  a  multiple 
of  the  F  distribution  with  the  usual  degrees  of  freedom. 

Kullback  and  Rosenblatt  (1957)  also  developed  a  discrimination  information  statistic  for 
discriminating  between  a  normal  regression  model  and  its  submodel  which  is  a  multiple  of 
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the  usual  F  ratio.  In  terms  of  the  ME  and  MDI  developments  of  this  paper,  the  problem  is 
formulated  as  follows. 

Suppose  that  Mi  ^  M2  with  columns  of  Xi  being  a  subset  of  columns  of  X2.  Without 
loss  of  generality,  let  (3^  =  (/?i,  •  •  •  ,/?pj,0,  •  •  •  ,0)',  0  <  pi  <  P2-  Thus  (3-^  is  partially  known 
and  ^2  completely  unknown. 

Based  on  the  quadratic  variation  with  (7|  =  cr^,  /c  =  1,2,  the  ME  model  M*  is  given  by 
(5.3).  According  to  the  result  given  in  Part  (ii)  of  Example  1.2,  the  normal  ME  model  for  Mi 
is  also  the  MDI  model  reference  to  the  normal  ME  model  for  M2,  subject  to  the  constraint 
E{Y)  =  X,p,. 

The  MDI  statistic  for  discriminating  between  the  two  models  is  given  by 

mfh  (5.7) 

h'2X'2X2h2-h\X[X,h, 

=  - ^2 - =  (P2  -  Pl)F, 

where  S2  denotes  the  unbiased  variance  estimate  (5.6)  under  M2  as  suggested  by  Kullback 
and  Rosenblatt  (1957).  The  last  equation  indicates  that  the  MDI  statistic  (5.7)  is  a  multiple 
of  the  usual  F  ratio  obtained  by  the  likelihood  ratio  method  or  in  the  analysis  of  variance 
procedure.  The  MDI  derivation  of  the  F  ratio  is  further  discussed  in  the  next  section. 

5.2  MDI  Diagnostics 

Let  fk{y,0k)  be  the  ME  distribution  implied  by  the  variation  function  Vk{£k)  associated 
with  the  linear  relationships  Mk  in  (5.1).  Here,  Ok  is  Uk  x  1  vector  containing  the  p*  coefficients 
and  all  the  parameters  related  to  Vk{£k)-  Alternative  models  are  compared  according  to  the 
minimum  information  discrepancy  between  the  unknown  data-generating  distribution  f{y) 
and  fk{y\0k). 

Among  a  set  of  alternatives,  Mk,  k  =  1, 2,  •  •  •,  the  model  M^*  is  information  optimal  if 

W(2/):/fc.(y;0fe-)l  <  K{f{y)-fk{yh)l  forall  1,2,- (5.8) 

where  K  is  an  estimate  of  the  MDI  function  and  Ok  is  an  estimate  of  the  model  parameter. 
The  subscript  k  of  underscores  the  fact  that  different  parametric  families  may  be  under 
consideration. 
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The  partial  F  ratio  is  commonly  used  in  search  for  selecting  a  submodel  Mk  of  a  normal 
regression  model  M^.  Examples  include  selecting  the  number  of  lags,  stepwise  regression, 
etc.  The  use  of  as  a  model  selection  diagnostic  sharply  differs  from  the  use  of  F  as  a  test 
of  hypothesis  which  requires  a  priori  specification  of  a  model.  The  difference  between  the 
casual  use  of  F  and  the  formal  statistical  inference  is  not  generally  recognized  in  common 
practice. 

The  MDI  statistic  (5.7)  gives  the  usual  partial  F  ratio  the  interpretation  of  an  information 
criterion  for  discriminating  among  the  submodels  of  a  normal  regression  model  Ml. 


Define  FIC{k)  as 

2k{rg  :  fk  ;(T^)  ^  ^ 

FIC{k)  =  - ^ ^ -  (5.9) 

PL-Pk 

PL-Vk 
(n  —  pL){d'l  — 

=  ,  \k.o  >  Mk  ri  Ml. 

{pL-PkWi 

In  the  last  expression,  the  variance  estimates  denote  the  MLE’s  under  the  normal  models. 

According  to  (5.9),  FIC{k)  is  an  estimate  of  the  information  loss  per  variable  omitted 
from  the  largest  model.  A  submodel  Mk  is  favored  over  another  submodel  Mi  whenever 


FIC{k)  <  FIC{e). 

The  FlC{k)  interpretation  of  the  partial  F  ratio  follows  from  the  Kullback  and  Rx)sen- 
blatt  (1957)  derivation  of  the  F  ratio  in  (5.7).  This  interpretation  allows  the  use  of  the 
partial  F  statistic  as  a  diagnostic  for  subset  selection.  Apart  from  the  philosophical  issues, 
in  practice,  there  are  generally  a  number  of  subsets  whose  F’s  are  above  a  threshold;  the 
one  with  the  minimum  F  will  be  selected  according  to  the  FIC(k)  criterion. 

In  general,  estimation  of  the  minimum  discrimination  information  function  (5.8)  when 
the  data-generating  distribution  H\f{y)\  is  unknown  is  a  difficult  problem.  The  equivalent 
expression  (4.1)  has  been  used  for  estimating  the  discrimination  information  function  in  a 
number  of  other  contexts  and  may  prove  to  be  useful  for  the  present  problem  in  future 
research  in  regression  context;  see,  Soofi  et  al.  (1995)  and  references  therein. 

In  the  regression  model  selection,  the  estimation  of  K\f{y)  :  fkiv,  &k)]  is  often  bypassed 
and  models  are  compared  according  to  an  estimate  of  the  expected  log-likelihood  function 
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(4.3).  The  information  criterion  (5.8)  holds  if  and  only  if 


Hjln-iyA-)]  <  HAnivA)]  for  all  k  =  \,2,---  (5.10) 


Akaike  (1973)  estimated  the  expected  information  discrepancy  under  the  assumption 
f{y)  =  fkiVi^L)  with  He  developed  the  following  approximate 

frequentist  risk  for  6k'. 


‘iEg  {K\f:(y,e,)}  :  KivA)]  «  +  V  " 

».  n  f;(y;ei)  n  n 


(5.11) 


This  approximation  is  obtained  using  the  second  order  relationship  between  relative  entropy 
and  Fisher  information,  the  asymptotic  normality  of  the  MLE,  and  asymptotic  Chi-square 
property  of  some  quadratic  terms. 

Akaike  proposed  using  the  approximate  risk  (5.11)  as  an  estimate  of  the  information 
criterion  (5.8)  for  selecting  a  submodel  of  fk{y',6i).  In  a  given  problem,  n,  i//,,  and  the 
likelihood  function  fl{y.,9i)  remain  constant.  Thus  models  with  various  i/^’s  are  compared 
according  to  the  information  criterion  (AIC) 


AIC{k)  =  -2logfk  (y,  6  k)  +  2uk,  =  1 , 2,  •  •  • 

The  quantity,  n~^AIC{k)  is  an  almost  unbiased  estimate  of  the  expected  log-likelihood 
2Hf[fl^{y',^i^,&l)]  in  (5.10).  The  model  that  minimizes  AIC{k)  is  approximately  minimum 
risk  and  satisfies  the  MDI  criterion  (5.8). 

For  the  case  of  normal  regression  models  (5.3),  I'fc  =  P*  -f  1  and  AIC{k)  is  given  by 


AIC{k)  =  n[/op(27re)  -|-  log  -1-  2(pfc  -|- 1) 

=  2H[n{y',p„al)]  +  2{pk  +  l). 


(5.12) 

(5.13) 


The  constants  n  and  log{2'Ke)  in  expression  (5.12)  are  ignoreable  in  applications.  The  expres¬ 
sion  (5.13)  shows  that  AIC{k)  discriminates  among  alternative  normal  models  by  combining 
the  model  uncertainty,  estimated  by  the  normal  regression  model  entropy,  and  the  number 
of  the  parameters  k,  giving  them  equal  weights. 

Sawa  (1978)  proposed  two  diagnostics  for  discriminating  among  normal  regression  models 
based  on  (5.10).  A  frequentist  diagnostic  is  obtained  by  inserting  estimates  u)^  and  in  the 
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expected  log-likelihood  component  of  the  approximation  (4.11),  and  is  given  by 

BIC{k)  =  -2log  my;^kA)  +  2(Pfc  +  1)  "  2  • 

If  —  up"  is  stochastic  of  order  then  BIC{k)  is  an  asymptotically  unbiased  estimate 

of  2Hf[f;^{y]  al)]  in  the  risk  function  (4.11).  The  issue  of  inserting  an  estimate  for  in 
the  risk  function  has  been  criticized  by.  Learner  (1979). 

The  variance  ratio  c^/CP  decreases  in  Pk  and  is  interpreted  by  Sawa  as  a  ’’discounting 
factor”  for  the  penalty  of  increasing  the  number  of  variables.  For  cfkfCP  =  1,  BlC[k)  reduces 
to  AIC{k).  Estimation  of  is  the  main  problem  with  the  implementation  of  BIC{k) 

For  the  normal  regression  models  BlC{k)  is 

(A  2 

For  discriminating  between  two  nested  normal  models  M*  Mi  with  columns  of  Xk 
being  a  subset  of  columns  of  Xi,  Sawa  suggested  using  CP  =  al-  Under  the  assumption 
of  f[y)  =  N{y,u^In),  model  selection  based  on  BlC{k)  is  equivalent  to  that  based  on 
the  magnitude  of  FIC{k)  statistic  defined  in  (5.9).  The  submodel  Mk  is  favored  over  Mi 
whenever  5/C7(/l)  B with  the  condition  in  terms  of  the  variance  ratio  is  given  as 

^og  -  2(p*  +  2)  +  2  +  2(pi  +  I)  <  0. 

The  BIC{k)  decision  rule  is  based  on  the  magnitude  of  FIC{k)  due  the  fact  that 

^  =  1  +  PLZlh.fic{k). 

&l  n- Pi 

Sawa  (1978)  also  developed  a  Bayesian  diagnostic  for  discriminating  between  nested  nor¬ 
mal  regression  models  based  on  the  lower  bound  (4.10).  He  found  that  the  Bayes  estimate 
that,  under  Mk,  minimizes  the  lower  bound  in  (4.11)  is 


n  +  pk  .2 

n-pk-2''^- 
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The  Bayes  estimate  gives  the  minimum  attainable  risk 


For  the  case  of  two  nested  normal  models  Mk  ^  Mi  with  columns  of  Xk  being  a  subset  of 
columns  of  Xi,  Sawa  showed  that  the  reduced  model  is  favored  by  the  minimum  attainable 
risk  (5.14)  if  and  only  if 


FlC{k)  < 


2(n-  l){n-pL) 


{n  +  pk){n  -  Pi  -  2)' 

Young  (1987)  developed  an  information  criterion  for  discriminating  between  normal  re¬ 
gression  models  by  defining  risk  as  K\f{y,y,,u)  :  f{y,^,a^)].  Young 

assumed  f{y)  =  N{y.,uj^In),  and  used  the  following  prior  distributions  for  the  parameters: 


'ir{fji\u'^)  =  N{m,iv  ^),  ir{uj  "^)  =  Gamma{a,u) 

=  N{mk,(T;^W^^),  T:{a;^)  =  Gamma{ak,Uk)-  (5.15) 

When  the  priors  are  weak  ( i.e.,  VF  — »  0,  a  — »  0,  jv  — >  0,  Wk  —^0,  ctk  0,  14  —*  0), 
the  Bayes  estimates  of  /3*.  and  erf  are  approximately  equal  to  the  MLE  under  the  model  Mk- 
If  the  priors  are  weak,  then  the  risk  is  approximately  minimized  by  the  model  that  minimizes 

CIC{k)  =  n  logal  +pk- 


Comparing  with  expression  (5.12),  we  note  that  CIC{k)  gives  one  half  as  much  weight  to 
the  dimension  of  the  model  as  that  given  by  AIC{k). 


5.3  Other  Discrimination  Information  Diagnostics 

Ibrahim  and  Laud  (1994)  used  discrimination  information  function  for  model  selection  in 
the  context  of  a  Bayesian  predictive  approach.  In  this  approach,  the  model  Mk  is  evaluated 
based  on  the  predictive  density  for  a  set  of  n  new  observations.  Let  y^^lXk]  denote  the  vector 
of  new  observations  taken  at  the  design  matrix  Xk-  Then  using  the  normal  ME  distribution 
(5.3)  as  the  likelihood  function  under  Mk,  the  predictive  density  is  given  by 

=  j  j f{y\Pk^orl)A^k^<^l\yWk  ^<^1 
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These  authors  considered  the  normal-gamma  prior  (5.15)  for  the  model  parameters.  The 
prior  mean  for  was  chosen  by  the  pseudo-parameter  value  rrik  =  {Xl.Xk)~^ where 
/io  is  a  ’’guess”  value  for  //.  The  prior  precision  matrix  was  chosen  as  Wk  =  X'kXk- 

Alternative  normal  models  are  compared  with  the  largest  model  Mi  according  to  the 
symmetrized  discrimination  information  function 

Xk,L  =  K[f{yir\Xk]\y) :  /(2/iv[A'i]|y)]  +  K[f{yi,[XL]\y)  :  /(2/jv[A"fe]|y)]. 

Computation  of  this  expression  involves  the  evaluation  of  the  discrimination  function 
between  two  multivariate  t  distributions  which  does  not  have  a  closed  form.  Ibrahim  and 
Laud  found  an  approximate  expression  for  the  symmetrized  information  function  and  showed 
that  for  the  case  of  vague  priors,  it  is  a  monotone  function  of  FIC{k). 

Carota,  Parmigiani,  and  Poison  (1996)  used  discrimination  information  function  in  the 
context  of  “model  elaboration”.  In  their  approach,  a  model  M  is  embedded  in  a  larger 
family  of  models  i.e.,  M  =  for  a  specific  value  ^o-  The  discrimination  information 
between  the  posterior  and  prior  of  the  elaboration  parameter  A"[vr(^|i/)  :  vr(^)],  is  used  as 
the  diagnostic  for  the  elaboration.  When  K['iT{Qy)  :  7r(C)]  is  small,  the  elaborated  model  is 
not  supported  by  the  data.  These  authors  developed  the  following  linearized  approximation 

^L[7r(Cl2/)  :  7r(0]  log{B)  +  S  -  Co), 

where  B  is  the  Savage  density  ratio  which  is  equivalent  to  the  Bayes  factor  under  certain 
conditions  and  S  is  the  score  function  defined  as  follows: 

B  =  S  =  ^log!(y\0  . 

fiy)  c=<o 

These  authors  discussed  a  regression  example  in  which  the  elaboration  is  defined  by  the 
inclusion  of  an  additional  variable  in  the  model.  That  is,  the  elaboration  parameter  is  the 
coefficient  of  the  additional  variable.  The  normal  likelihood  function  (5.3)  and  the  normal- 
gamma  prior  (5.15)  are  used.  In  the  prior,  the  elaboration  parameter  has  mean  zero  and 
is  uncorrelated  with  the  other  coefficients.  In  this  problem,  K['K{^\y)  :  7r(C)]  is  also  the 
discrimination  information  between  two  Student-t  distributions  and  does  not  have  a  closed 
form.  Carota,  Parmigiani,  and  Poison  (1996)  showed  that  the  linearized  version  provides  an 
accurate  approximation  when  compared  with  the  case  of  known  error  variance. 
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5.4  Complexity  Diagnostics 

We  have  already  seen  that  the  MDI  diagnostics  AIC{k),  BIC{k),  and  CIC{k)  discriminate 
among  the  alternative  models  based  on  model  fit  as  indicated  by  the  log-likelihood  term  and 
the  model  complexity  as  indicated  by  a  term  involving  Pk',  see  also  Poskitt  (1987).  Some 
authors  (Rissanen  1986,  1987a,  1987b;  Bozdogan  1990;  Bozdogan  and  Haughton  1995)  have 
proposed  model  selection  criteria  with  emphasis  of  model  complexity. 

Rissanen  defined  stochastic  complexity  for  a  given  class  of  models  as  “the  number  of 
binary  digits  with  which  the  observations  can  be  described”  (Rissanen  1987a).  Let  the 
model  class  be  defined  by  the  pair  of  density  functions: 

=  {\f{y\Pk,0k),  A^k\Pk)],  Pfc  =  0,l,---}, 

where  Pk  is  the  number  of  free  parameters  in  the  model  pair.  Then  stochastic  complexity  of 
the  data  points  j/i ,  •  • ' » 2/n  is  measured  by 


SC{k)  =  -logY,  — W  /  fiy\Pk,0k)dn{0k\pk),  Pk  <  n.  (5.16) 

The  prior  is  assumed  to  be  concentrated  near  the  MLE  estimate  0k-  The  model  that  mini¬ 
mizes  (5.16)  in  C/,7r  is  preferred. 

For  sufficiently  large  n,  an  approximate  upper  bound  for  SC{k)  is  minimized  and  the 
criterion  is  referred  to  as  the  Minimum  Description  Length  (MDL).  Ignoring  the  non-essential 
terms,  MDL{k)  compares  the  models  according  to: 


MDL{k)  «  -logf{y\0k)  +  llog 

«  -logf{y\0k)  +  ^log  n. 


d^logf{y\0k) 


80, 


For  the  normal  regression  model,  MDL{k)  may  be  written  as 
MDL{k)  «  n[Zo5 (27re)  -f-  log  al]  + 

Thus,  the  weight  {log  n)/2  given  to  the  dimension  of  the  model  pk  is  larger  for  MDL{k)  in 
comparison  with  the  MDI  diagnostics. 
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Bozdogan  (1990)  defined  the  complexity  of  a  p- variate  normal  distribution  with  covariance 
E  by  the  maximal  mutual  information 

^[^(X)]  =  maxi9[rX  A(rX)i,---,(TX)p] 

=  |/05  -  iios  lEl,  (5.ir) 

where  T  is  the  set  of  all  orthonormal  transformations  in  W.  This  measure  is  motivated  by 
the  fact  that  the  mutual  information  t9(X  AXi,  •  •  ■,Xp)  is  not  invariant  under  axis  rotations 
(see  Example  2.3),  whereas  (5.17)  is  invariant  under  axis  rotations. 

Bozdogan  and  Haughton  (1995)  defined  Informational  Complexity  of  a  model  as 

ICOMP{k)  =  -2logf{y]ek)  +  201^0^)], 

where  E(0fc)  is  the  covariance  matrix  of  the  estimated  parameters.  These  authors  also 
discussed  attaching  a  weight  On  to  the  complexity  term. 

Two  alternative  methods  proposed  by  Bozdogan  and  Haughton  (1995)  for  estimating  the 
complexity  term.  One  method  uses  a  sample  version  of  the  covariance  matrix,  A 

second  method  uses  the  inverse  of  the  estimated  Fisher  information  matrix  [.F(0)]“^  In  this 
case,  the  informational  complexity  criterion  is 

ICOMPIFIM{k)  =  -2logf{y,ek)  +  2C[{:F{0k)}~^]. 

For  the  normal  regression  model,  ICOM P{k)  is  computed  as 

lCOMP{k)  =  -2log[f;^{y,Pk^al)]  +  2C[t0^)] 

—  n[log{2-Ke)  +  log  dl\  +  pklog  ^ i " — )  ~ 

The  second  version  of  the  complexity  criterion  is  computed  as 


ICOMPIFIM{k)  = 
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6  Collinearity  Analysis 


The  existence  of  a  near  linear  relationship  among  the  columns  of  the  regression  matrix  X 
is  referred  to  as  (near)  collinearity.  When  X  is  collinear,  the  inversion  of  X'X  is  problematic 
and  creates  computational  and  conceptual  problems  in  regression  analysis.  The  compu¬ 
tational  issue  is  that  the  solution  to  the  least  square  equations,  b  =  {X'X)~^X'y,  changes 
drastically  with  a  slight  perturbation  of  X.  The  conceptual  aspects  of  the  collinearity  are  the 
problems  associated  with  the  inference  based  on  a  distribution  that  depends  on  a  collinear 
regression  matrix. 

Often  the  subject  of  inference  is  the  regression  coefficient  vector,  (3.  The  traditional 
literature  has  casted  the  collinearity  problem  as  the  lack  of  adequate  “information”  in  the 
data  for  estimating  /3,  but  has  not  gone  beyond  the  semantic  notion  of  information.  Formally, 
the  effects  of  collinearity  on  information  about  /3  can  be  measiued  in  terms  of  the  entropy, 
relative  entropy,  and  mutual  information  functions  discussed  in  Section  2  (Soofi  1988,  1990). 

Consider  the  normal  ME  regression  model  =  N{X/3,o’^In).  In  the  Bayesian 

framework,  is  subject  to  variation  and  the  posterior  distribution  of  /3,  given  the  data 
is  the  vehicle  of  inference  about  the  regression  coefficients.  In  the  frequentist  approach, 
the  inference  about  the  regression  coefficient  is  based  on  the  distribution  induced  by  the 
sampling  variation  of  the  data  on  an  estimate  of  However,  as  will  be  seen,  there  are  some 
information  dualities  between  the  two  approaches. 

The  following  reparametrization  of  (3.1)  is  useful  for  collinearity  analysis. 

y  =  {Xr){r'^)  +  £  =  W(x  +  e,  (6.1) 


where  r  =  [7i,‘">7p]  is  the  orthogonal  matrix  of  the  eigenvectors  of  X'X  and  W  = 
[Wi,  •  •  • ,  Wp]  is  the  transformed  regression  matrix  in  the  directions  of  the  principal  compo¬ 
nents  of  X.  Note  that 


W'W  =  A  = 


(  Ai 
0 


0  •••  o'' 

A2  •••  0 


(6.2) 


\  0  0 

Ai  ^  being  the  eigenvalues  of  X'X. 
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6.1  Collinearity  Diagnostics  for  Least  Square  Regression 

The  solution  to  the  least  square  equation,  b  =  {X'X)~'^X'y,  is  justified  as  an  estimate  of 
P  according  to  various  estimation  criteria.  It  is  the  MLE  of  (3  under  normal  ME  regression 
model.  In  Section  4,  h  was  seen  as  the  MDIM,  an  approximate  MDIB,  and  the  BMOM 
estimate  of  /3. 

Consider  the  case  when  the  only  prior  information  used  about  (3  is  the  ranges  of  the 
variations  of  pj's.  As  shown  in  Table  2,  the  ME  prior  7r*(/3)  is  uniform.  The  prior  entropy 
is  an  increasing  function  of  the  ranges  of  the  variations.  If  the  ranges  are  very  large,  then 
the  prior  is  noninformative  about  predicting  the  regression  coefficients;  i.e.,  (/3)]  is  a 

large  constant.  Given  the  data  and  cr^,  the  conditional  posterior  distribution  is  'K{(3\y,  o^)  = 
N[b,a‘^{X'X)-\ 

The  amount  of  uncertainty  in  predicting  a  value  of  /3  is  measured  by  the  posterior  entropy 

Hx{P\y,(^"^)  =  ^log{2Trea^)  -  log\X'X\^^^.  (6.3) 

Since  the  prior  is  noninformative,  the  effects  of  collinearity  on  the  information  content  of  the 
data  about  the  regression  coefficients  are  examined  based  on  Ix{P\o'^)  =  ~Hx{P\y,o'^)- 

The  advantage  of  the  representation  (6.1)  is  that  7r(Q:|i/,  cr^)  =  N{a,a^A  ^),  a  =  T'b 
and  A  is  defined  in  (6.2).  That  is, the  regression  coefficients  in  the  directions  of  the  principal 
components  aj  =  7^/3  are  uncorrelated  normal,  hence  are  independent.  Writing  the  deter¬ 
minant  in  (6.3)  as  the  product  of  the  eigenvalues,  the  posterior  entropy  is  decomposed  in 
terms  of  the  entropies  of  the  independent  components  of  a, 

lx{(3\a‘^)  =  Y^log  -^log{27tea^)  (6.4) 

i=i 

j=i 

The  components  of  information  in  (6.4)  are  comparable  only  when  the  columns  of  X  are  equi¬ 
librated.  Henceforth,  assume  that  the  columns  of  X  are  scaled  so  that  X'X  is  in  correlation 
form. 
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Given  the  error  variance  the  information  quantities  Iw.{aj\a^)  are  ordered  according 
to  the  eigenvalues  of  X'X,  with  the  first  element  tti  =  7'j^  being  the  most  informative  (min¬ 
imum  entropy);  i.e.  the  least  difficult  to  predict.  Note  that  “given  cr^”  implies  the  presence 
of  all  the  components  in  the  regression  equation.  Thus  the  components  of  information  in 
(6.4)  are  not  criteria  for  reducing  the  model.  They  display  the  information  spectrum  of  the 
regression  matrix  as  a  whole. 

The  explanatory  variables  are  most  informative  (minimum  entropy)  about  the  regression 
coefficients  when  the  regression  matrix  is  orthogonal.  Thus,  a  measure  of  information  loss 
due  to  collinearity  of  X  is  given  by  the  information  difference 

(6.5) 

=  -  lx(0W^) 

=  -log\X'X\'l^ 

where  X°  denote  an  orthogonal  reference  regression  matrix. 

The  information  indices  of  a  regression  matrix  are  defined  by  the  information  differences: 

A,(X)  = 


where  Kj{X),  j  —  1,  •  •  •  ,p  are  the  condition  indices  of  the  regression  matrix  (Belsley,  Kuh, 
and  Welsch  1980). 

The  information  number  of  a  regression  matrix  is  defined  by  the  information  range 

/A 

A{X)  =  log  (  =  log  k{X), 

where  k{X)  is  the  condition  number  oi  X. 

The  information  spectrum  of  X®  is  uniform  and  Aj(X^)  =  0  for  all  y  =  1,  •  •  •  ,p.  There¬ 
fore,  the  origin  of  measurement  for  the  information  indices  is  the  orthogonality.  For  example, 
A(X)  directly  measures  the  maximum  extent  of  the  collinearity  of  X  in  relationship  to  the 


log  Kj(X),  j  = 


45 


orthogonality.  Other  consequences  of  the  logarithmic  transformations  of  \X'X\  and  k  are 
discussed  Soofi  (1990). 

In  the  least  square  regression,  a  computed  6  is  viewed  as  an  outcome  of  the  sampling 
distribution  /(6|^,cr^)  =  N\^,(t'^{X'X)~^\.  The  entropy  of  the  sampling  distribution  is  also 
given  by  (6.3).  Therefore,  all  the  statements  made  regarding  the  posterior  entropy  (6.3) 
also  have  frequentist  interpretations.  In  the  sampling  theory  inference,  (6.3)  quantifies  the 
amount  of  uncertainty  in  predicting  a  value  of  the  least  square  estimate  b. 

The  effects  of  collinearity  on  the  least  square  estimation  may  also  be  measured  by  the 
discrimination  information  function  between  the  actual  sampling  distribution  /x(b|/3,a^)  = 
N[l3,a'^{X'X)~^\  and  the  sampling  distribution  of  the  estimate  as  if  the  regressors  were 
orthogonal.  In  collinearity  analysis,  traditionally  it  is  assumed  that  the  artificial  refer¬ 
ence  sampling  distribution  based  on  the  orthogonal  regression  matrix  has  mean  /?,  i.e., 
fxo(h°\l3,a'^)  =  N(/3,  (T^Jp).  Consequently,  the  discrimination  information  between  the  two 
sampling  distributions  is  given  by  the  information  discrepancy  due  to  the  covariances  of  two 
normal  distributions: 


K(h  :  6") 


K\fxmy)  ■  h‘{v\0y)\ 


'-[Tr{X'X)-'-log\X'X\-'-p 


1 

2 


+  >^3-p  • 

j=i  J=1 


Since  K{b  :  b°)  >  0,  with  equality  if  and  only  if  X  is  orthogonal  and  K{b  :  b°)  oo 
as  X  descents  to  perfect  collinearity,  K{b  :  b°)  measures  the  loss  of  information  in  the  least 
square  estimation  due  to  the  nonorthogonality  of  X. 

Tr{X'X)~'^  and  \X'X\  are  collinearity  diagnostics  with  traditional  statistical  interpre¬ 
tations.  K{b  :  b°)  is  composed  of  the  trace,  determinant,  and  the  rank  of  {X'X)  The 
information  loss  is  measured  by  a  comprehensive  summary  of  the  covariance  matrix  of  the 
sampling  distribution  and  is  inclusive  of  the  traditional  measures  . 

The  discrimination  information  function  between  the  actual  posterior  distribution  of  the 
regression  coefficient  vector,  =  N[b,(X^{X'X)-^  ,  and  the  posterior  distribution 
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as  if  the  regressors  were  orthogonal,  7rxo{/3|T/,  cr^)  =  N{b°,a^Ip)  is 
K{^x  ■■  <'")  =  K(b  :  i>»)  + 

The  second  term  is  the  information  discrepancy  due  to  two  different  posterior  means.  This 
term  is  a  measure  of  the  effect  of  collinearity  on  the  solutions  to  the  least  square  equations 
which  is  ignored  in  the  traditional  collinearity  diagnostics. 

It  is  well  known  that  when  X  is  near-collinear,  the  least  square  solutions  are  sensitive  to 
small  perturbations  of  X.  The  discrimination  information  function  between  the  perturbed 
posterior  distribution  and  the  actual  posterior  distribution  is 

K{nx-:TTx\y,CT^)  =  ^[Tr{X'X){X*'XT'-log\X'X{X*’XT^\-p 

{b-b*)'X'X{b-b*) 

where  X*  =  X  +  dX  is  the  perturbed  regression  matrix  and  b*  is  the  perturbed  least 
square  solution.  The  first  term  measures  the  effect  of  the  perturbation  on  the  covariance 
structure.  This  term  may  also  be  interpreted  as  the  effect  of  the  perturbation  on  the  sampling 
distribution  of  the  least  square  estimate.  The  second  term  is  the  effect  of  perturbation  on 
the  solutions  to  the  least  square  equations. 

6.2  Collinearity  Diagnostics  With  Prior  Information 

Consider  the  case  when  the  prior  information  assumed  about  the  regression  coefficients  is  in 
the  form  of  the  quadratic  variation  functions  V(/3)  =  {Pj  —  Trij),  j  =  1,  •  •  •  ,p.  Table  2  gives 
the  ME  prior  7r*(^|m,r^)  =  N{m,r^Ip).  The  prior  independence  among  the  regression 
coefficients  is  due  to  the  fact  that  no  information  about  the  interrelationships  between  were 
used  in  the  ME  computation. 

The  prior  uncertainty  about  the  regression  coefficients  is  given  by 

=  |%(27re)  +  ^log  r^.  (6.6) 

Based  on  the  ME  normal  likelihood,  the  posterior  distribution  given  the  error  variance 
is  TrifSly,  m,  r^)  =  X[6(0,  m),  a'^{X'X  +  <f>Ip)~^]  where 
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b[4i,m)  =  {X'X  +  4,If)-\X'y  +  0m). 


(6.7) 


is  the  prior  to  model  precision  ratio. 

The  posterior  entropy  is 

=  ^log{2'Ke)  +  ^log  +  log\(l)Ip  +  (6.8) 

The  sample  information  about  the  regression  coefficients  is  given  by  the  entropy  difference 

'dx{m  =  m\r‘')-Hx[P\y,^\T‘^)  (6.9) 

=  log\Ip  +  <l>-^X'X\^l\ 

In  fact,  'dx{P\<t>)  is  the  mutual  information  between  y  and  (3,  '0x{(^\4>)  =  '^x{P ^y\4>)-  Here, 
taking  the  expectation  with  respect  to  the  distribution  of  y  is  not  needed  because  the  entropy 
difference  (6.9)  is  functionally  independent  of  y.  (More  generally,  every  sample  drawn  from 
a  normal  distribution  is  informative  about  the  mean  which  is  also  drawn  from  a  normal 
distribution.) 

Although  the  prior  entropy  (6.6)  depends  on  the  prior  variance  and  the  posterior  entropy 
(6.8)  depends  on  the  prior  and  error  variances,  the  sample  information  (6.9)  depends  on  the 
precision  ratio  0,  which  is  the  pivotal  quantity  in  the  collinearity  analysis  in  the  presence  of 
prior  information.  The  sample  information  is  decomposable  as: 

W\<l>)  =  (6.10) 

i=i 

j=l 

=  i?H/(a|0). 

Thus  given  (j),  the  components  of  the  sample  information  i9vVj(*^il0)  ordered  according 
to  the  eigenvalues. 
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The  mutual  information  is  maximum  when  the  regression  matrix  is  orthogonal.  In  the 
presence  of  prior  information,  a  measure  of  information  loss  due  to  collinearity  of  X  is  given 
by  the  information  difference 


i3Lx{f3\(f>)  =  ^max{^\<f>)  - 

1/2 


MM) 


The  sample  information  loss  'dLx{f3\(j))  has  the  following  properties. 

(i)  dLx{/3\<f)  is  monotonically  decreasing  in  (j).  Given  cr^,  i9Lx(/3|0)  is  monotonically 
decreasing  in  the  prior  precision 


(ii)  Given  'dLx{^\4>)  <  ILx{P\(7"^)  for  all  >  0;  and  i9Lx(y9|0)  — ^  lExiPla"^)  as 
oo. 

Given  the  precision  ratio  (j),  the  information  indices  of  X  are  obtained  by  the  information 
differences 


A,(X,0)  = 

=  Hxh'jm-Hxh\m 


(6.11) 


=  log 


<p  +  Xi 


1/2 


<l>  +  XjJ 

log  Kj^d)Ip  X 


1,  •••,?, 


Aj{X,  (f))  generalizes  Aj(X)  which  is  given  by  0  =  0.  Further  generalization  may  be  obtained 
by  using  d>j,  j  =  1,  ■  •  •  ,p  in  (6.11). 

The  information  indices  (6.11)  are  also  interpretable  in  the  sampling  theory  framework. 
By  letting  m  =  0  in  (6.7)  we  obtain  the  ridge  estimate  of  the  regression  coefficients  with 
the  ridge  parameter,  (f).  The  entropy  of  the  sampling  distribution  of  the  ridge  estimate  is 
also  given  by  (6.8).  By  the  second  equality  in  (6.11)  the  information  indices  display  the 
information  spectrum  of  the  sampling  distribution  of  the  ridge  estimate  6(^) . 

Another  measure  of  information  loss  due  collinearity  in  the  presence  of  prior  information 
is  given  by  the  discrimination  information  function  between  the  actual  posterior  distribu¬ 
tion  7rx(/31j/,  cr^,m,r^)  and  the  posterior  distribution  as  if  the  regressors  were  orthogonal, 
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7r^o(^|y,  (7^,771,  r^).  This  discrimination  information  loss  due  to  collinearity  is  given  by 
K{Trx  :  = 

i  {Tr  [(1  +  ,I>){X'X  +  -  log  |(1  +  4>){X'X  +  - p] 

-  6“(0,m)l'16{^,m)  - 

The  discrimination  information  loss  has  the  following  properties. 

(i)  For  all  0  >  0,  K{'Kx'-  7rxo|t/,(r2,m,r2)  is  finite  for  all  X. 

(ii)  For  a  given  X,  K{'Kx  :  is  monotonically  decreasing  in  (p. 

(iii)  For  all  r2  >  0,  K{nx  :  irxo\y,a^,m,T^)  <K{t^x-  T^xo\y,(r'^). 

We  have  seen  that  the  information  loss  due  to  collinearity,  measured  by  'dLx{(^\<l>)  or 
K{txx  :  7rxo|2/,<r^,m,T^)  is  always  less  than  the  loss  when  no  prior  information  regarding 
the  variation  of  the  regression  coefficients  is  used.  Therefore,  one  can  compensate  for  the 
sample  loss  of  information  due  to  collinearity  by  acquiring  nonsample  information  in  order 
to  decrease  the  maximum  average  variation  of  the  regression  coefficients,  r^;  i.e.  to  increase 
the  prior  precision.  A  collinearity  information  graph  may  be  constructed  by  plotting  various 
information  functions  against  cp  or  r^,  see  Soofi  (1990).  The  graph  is  useful  for  determining 
the  prior  precision  needed  for  a  certain  amount  of  collinearity  loss  reduction. 

Note  that  the  form  of  prior  information  used  is  important  for  the  collinearity  analysis. 
The  normal  ME  prior  =  N{m,T'^In)  is  obtained  based  on  the  information 

regarding  the  variations  of  the  individual  regression  coefficients.  The  diagnostics  discussed 
above  are  based  on  the  prior  ignorance  about  the  interrelationships  among  the  regression 
coefficients.  If  we  wish  to  include  information  regarding  the  interrelationships  among  the 
regression  coefficients  in  the  prior,  we  should  use  the  variation  function  of  the  form  V(/3)  = 
(/3  -m){P-  m)'.  Then  Table  2  gives  the  ME  prior  7r*(;9|m,T2,  ^  In  this 

case,  the  sample  information  about  the  regression  coefficients  is  given  by 

i?x(/3|0,  =  iog\Ip  +  (P-^'^X'XI^^^ 

As  an  example,  consider  Zellner’s  ^g-prior  (3.6)  which  uses  ^  =  {X'X)~^.  Based  on  this 
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prior,  the  sample  information  about  the  regression  coefficients  is  given  by 

I-)  =  (X'X)-')  =  . 

Because  the  sample  information  regarding  the  correlation  structure  of  the  regiession  coef¬ 
ficients  has  already  been  used  in  the  prior,  the  sample  does  not  add  any  new  information 
about  the  correlation.  This  fact  is  reflected  in  the  sample  information  function  being  free 
from  X.  The  two  cases,  ^  =  Ip,  and  ^  are  two  extremes  in  their  impacts 

on  the  collinearity  effects.  The  prior  independence  reduces  the  effects  collinearity  on  the 
information  about  the  regression  coefficients,  whereas  ^  leaves  the  collinearity 

problem  intact. 

6.3  A  Collinearity  Diagnostic  for  Random  Regressors 

Consider  the  case  when  the  explanatory  variables  X  =  {Xi,  •  •  • ,  Ap)'  are  jointly  normal. 
Then  the  information  about  each  Xj  provided  by  the  set  of  other  explanatory  variables  X(^-j) 
is  given  by  the  entropy  reduction 

nXjlX^.j))  =  H{Xj)-H{Xj\X^-j)) 

=  ~log[l  -  p2(x,;X(_j))],  i  =  1,*  •  •  ,P, 

where  p'^{Xj‘,X(-j))  is  the  square  of  multiple  correlation  between  Xj  and  the  other  explana¬ 
tory  variables. 

As  indicated  in  Example  2.3  (iii),  the  entropy  reduction  t9(XjjA(_j))  is  also  the  mutual 
information  function  between  Xj  and  the  other  explanatory  variables,  ‘0{Xj  A  X(-j)).  As 
such,  i?(Xj|A(_j))  measures  functional  dependency  between  Xj  and  the  other  explanatory 
variables.  For  the  multivariate  normal  case  the  functional  dependency  is  linear.  Therefore, 
'd{Xj\X(^-j))  is  an  information  measure  of  collinearity. 

The  sample  version  of  ‘d{Xj\X(^-j))  is  related  to  the  traditional  variance  inflation  factor 
(VIF)  as 

=  log  VIFf.  (6.12) 

The  VIF  is  a  useful  and  widely-used  collinearity  diagnostic.  Traditionally,  VIFj  is  inter¬ 
preted  as  the  inflation  factor  of  the  variance  of  the  sampling  distribution  of  the  least  square 
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estimate  bj,  as  compared  with  the  case  of  orthogonal  regressors.  It  is  also  interpreted  as 
a  transformation  of  the  multiple  correlation  Hjj.  The  relation  (6.12)  gives  an  information 
theoretic  interpretation  to  VI F. 

6.4  Principal  Component  Regression 

Principal  Component  Regression  (PCR)  refers  to  selecting  a  subset  of  the  transformed 
regressors  in  the  reparametrized  model  (6.1)  and  estimating  P  based  on  the  reduced  model. 
The  purpose  of  PCR  is  to  reduce  the  effects  of  collinearity  on  the  regression  coefficients. 

Let  <5  be  a  subset  of  the  index  set  {1,  •  ’ '  iP}  containing  q  elements.  Let  Pg  denote  the 
submatrix  containing  the  q  eigenvectors  Jj,  j  €  Q  of  X'X.  Then  the  model  (6.1)  may  be 
reduced  as 

y  =  {xr){r'^)  +  e 

=  (xrg)(r'g;3)  +  (xrg)(r^/3)  +  e 

=  Wqaq  +  £q,  (6-13) 

where  Q  is  the  complement  of  Q  in  the  set  of  the  first  p  integer;  VPg  and  ocq  contain  Wj 
and  ttj,  j  G  Q,  respectively.  The  error  term  in  (6.13)  is  defined  by  Sq  =  Wgttg  +  £■  If  the 
specification  of  the  full  model  is  correct,  then  for  q  <  p,  £q  does  not  have  the  covariance 
structure  a'^In-  In  the  sampling  theory  approach  ag  is  set  to  zero  which  contradicts  the  full 
model  specification.  In  Bayesian  analysis  the  issue  is  of  no  concern  when  the  prior  expected 
value  is  E{^(3j)  —  0;  see  Soofi  (1988)for  details. 

Given  an  estimate  dg,  a  PCR  estimate  of  ^  is  obtained  by  ^g  =  Pgdg.  The  Bayes 
PCR  estimate  of  /3  based  on  the  ME  normal  likelihood  and  the  ME  normal  prior  7r*(/3|r^)  = 
Ar(0,  T^Ip)  is  given  by 

Pq{4>q)  =  TglAg  +  (6-14) 

where  ^g  is  the  precision  ratio  for  the  reduced  model  and  Ag  =  WqW q  which  is  the  subma¬ 
trix  of  (6.2)  with  diagonal  elements  Aj,  j  G  Q- 

The  PCR  estimate  (6.14)  is  a  general  class  representation  of  several  well-known  regression 
estimates.  For  q  =  p,  (6.14)  gives  the  Bayes  estimate  b{(p,  0)  which  is  also  the  ridge  regression 
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estimate  b{<f)).  When  =  0  and  q  =  (6-14)  gives  the  ordinary  least  square  estimate  b 

which  is  the  posterior  mean  under  uniform  prior.  The  traditional  PCR  estimate  is  found  by 
letting  4>  =  0. 

The  main  issue  in  PCR  is  selection  of  Q.  The  information  functions  (6.4)  and  (6.10)  can 
be  used  to  meeisure  the  amount  of  information  about  /3  retained  in  a  reduced  model.  In 
the  previous  sections,  the  collinearity  diagnostics  were  developed  as  if  the  regression  error 
variance  and  the  prior  variance  were  known.  In  practice  these  quantities  are  estimated 
for  computing  the  information  functions. 

Let  ffq  be  an  estimate  obtained  using  the  reduced  model  (6.13).  Then  the  amount  of 
information  (6.4)  retained  in  the  reduced  models  about  (3  are  compared  according  to 

4?  =  2/wQ(aQ|crQ)  =  2/vi/Q(rQ/3|d’Q) 

=  Yi,  ^^9  -  logi^'Kc).  (6.15) 

i€Q  \^Q/ 

Similarly,  using  an  estimate  f^,  the  amount  of  information  (6.10)  retained  in  the  reduced 
models  about  (3  are  compared  according  to 

^q  =  ‘^^Wq{ocq\^q)  =  2^vv'Q(rQ/3|0Q) 

=  Yi  ^^9{^  +  0  ^-^j) 

jeg 

=  +  (6.16) 
jeQ  \  / 

The  information  criteria  (6.15)  and  (6.16)  compare  models  based  on  the  relative  informa¬ 
tional  value  of  Wq  within  the  set  of  regressors  Wi,  •  •  • ,  Wp  as  measured  by  the  eigenvalues 

j  ^  the  precision  of  the  model  as  estimated  by  l/d’g. 

The  merits  of  the  information  functions  for  PCR  are  best  seen  in  the  case  of  simple  regres¬ 
sion  models  that  include  a  single  transformed  variable  Wj;  i.e.,  Q  =  {j}.  The  components 
of  information  (6.4)  about  the  parameters  aj  —  j'j(3  are  compared  according  to 

ij  =  2/w. =  log  -  log{2i^e),  i  =  1,  •  •  •  ,p,  (6.17) 

where  dj  is  the  error  variance  for  the  estimated  simple  regression. 
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The  components  of  sample  information  (6.10)  are  compared  according  to 

=  '09(1+^ 

Both  information  functions  (6.17)  and  (6.18)  compare  components  based  on  the  ratio 
Xjfa^j.  The  eigenvalue  signifies  the  relative  informational  value  of  Wj  within  the  set  of  re¬ 
gressors  •  •  • ,  ITp,  and  the  error  variance  indicates  the  relative  strength  of  the  relationship 
between  the  Wj  and  the  dependent  variable.  These  information  functions  favor  components 
that  are  strong  on  the  balance  of  these  two  features. 

6.5  MDI  Selection  of  Prior  (Ridge)  Parameter 

In  the  presence  of  severe  collinearity,  the  least  square  procedure  often  produces  meaningless 
estimates  for  regression  coefficients.  The  signs  and/or  the  magnitudes  of  the  least  square 
estimates  or  of  some  functions  of  the  estimated  coefficients  may  not  be  meaningful.  If  the 
problem  is  due  to  collinearity,  it  should  be  corrected  by  reducing  the  effects  of  the  collinearity 
in  estimation.  For  example,  when  the  regression  matrix  is  orthogonal,  the  signs  of  the  least 
square  estimates  correspond  to  the  signs  of  the  simple  correlation  coefficients  between  the 
regressors  and  the  dependent  variable.  In  the  case  of  severe  collinearity  the  signs  of  the  least 
square  coefficients  may  differ  from  the  orthogonal  case.  Thus  reduction  of  the  extent  of  the 
collinearity  should  correct  the  problem.  Since  the  least  square  estimates  do  not  satisfy  some 
constraints  that  the  regression  coefficients  should  satisfy,  it  should  be  corrected. 

Consider  the  family  of  estimates  constructed  by  the  linear  transforms  of  the  least  square 
estimate 

-p  =  {bo  =  r'DFb  :  D  =  diag[di,  •  •  •  ,dp],  dj  >0,  j  =  1,  •  •  •  )P} » 

where  F  is  the  matrix  of  the  eigenvectors  of  X'X. 

The  elements  of  the  diagonal  matrix  dj,  j  =  1,  •  •  •  ,p  are  the  altering  coefficients.  Their 
role  is  more  directly  seen  in  estimation  of  the  coefficient  a  of  the  reparametrized  model  (6.1). 
The  estimate  of  a  corresponding  to  bo  is  given  by  the  simpler  linear  transform 

do  ~ 


j  =  h- 


,P- 


(6.18) 
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where  a  =  Fh  is  the  least  square  estimate  of  a. 

A  well-known  subfamily  in  T)  is  defined  by  64,  =  F'a*  with 

a^.  =  (A  -f  $)-^Aa  (6.19) 

where  A  is  the  diagonal  matrix  defined  in  (6.2)  and  <I>  is  a  diagonal  matrix  with  diagonal 
elements  <pj  >  0,  j  =  1,  •  •  •  ,p.  The  altering  coefficients  in  (6.19)  are 

Aj  +  (pj 

For  (pj  =  (p,  j  =  l,---,p,  (6.19)  gives  the  Bayes  estimate  6(^,0)  shown  in  (6.7)  which 
is  also  referred  to  as  the  ordinary  ridge  estimate.  The  posterior  mean  based  on  the  normal 
likelihood  and  the  normal  prior  tt*{oc)  =  =  diag[Tf,---,rp]  is  in  the  form  of 

(6.19)  with  <pj  =  In  the  ridge  regression, (6. 19)  is  referred  to  as  the  generalized  ridge 

estimate  (Hoerl  and  Kennard  1970). 

When  the  least  square  is  transformed  in  order  to  circumvent  the  ill-effects  of  collinearity, 
then  it  is  natural  to  seek  the  minimal  amount  of  alteration  required.  The  alteration  is 
considered  as  a  perturbation  of  a  distribution  associated  with  the  least  square  estimate  for  the 
inferential  purposes.  In  Bayesian  analysis,  b  is  the  Bayes  estimate  under  the  noninformative 
prior  and  the  posterior  distribution  is  altered  with  the  use  of  prior.  Thus  the  search  is  for 
identifying  the  minimum  prior  precision  r?  required  for  an  adequate  estimation  of  P  (Soofi 
and  Soofi  1989).  In  the  sampling  theory  approach,  the  sampling  distribution  is  perturbed. 
Thus  the  problem  is  to  select,  for  example,  a  ridge  procedure  that  gives  adequate  parameter 
estimates  with  minimal  perturbation  (Soofi  and  Gokhale  1991b). 

More  formally,  suppose  that  the  regression  coefficient  is  constrained  as  13  E  B,  where  B 
is  a  subset  of  3?*’.  Then  e  P  is  chosen  such  that: 

(i)  bL  e  B- 

(ii)  K{b*j^  :  b)  <  K{bD  ■  b)  for  all  b  eV,  where  K{bD  :  b)  is  the  discrimination  information 
function  between  the  distributions  associated  with  the  two  estimates. 

For  the  case  of  normal  ME  likelihood  and  the  normal  ME  prior,  the  discrimination 
information  function  is: 
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K{ajj  :  a|a,  cr^)  is  a  convex  function  of  the  altering  coefficients  with  the  global  minimum 
ai  D  =  Ip.  Thus,  it  may  minimized  with  respect  to  the  altering  coefficients.  In  practice 
(T^  are  estimated  and  the  minimization  is  iterative.  For  the  case  of  (6.19),  the  minimization 
is  with  respect  to  ^>.  In  some  applications  the  formal  minimization  may  be  replaced  with 
simpler  search  methods.  For  a  Bayesian  application  see  Soofi  and  Soofi  (1989)  and  for  a  ridge 
analysis  see  Soofi  and  Gokhale  (1991b).  In  the  sampling  theory  approach  K{aD  :  a\(i,d^) 
is  the  MLE  estimate  of  the  discrimination  information  function  between  the  distributions  of 


the  ridge  and  the  least  square  estimates. 


7  Influence  of  Observations  on  Information 


In  this  section  I  present  diagnostics  for  measuring  influence  of  an  observation  on  the 
distributions  associated  with  the  regression  coefficients  and  on  the  predictive  distribution. 

Consider  the  case  of  the  noninformative  prior  for  the  coefficients  of  the  normal  ME 
regression  model  (3.4).  The  influence  diagnostics  for  this  case  are  interpretable  in  terms 
of  the  ordinary  least  regression.  The  extension  to  the  informative  priors  may  be  developed 
similarly. 

Let  X-i  and  y_i  denote  the  data  with  the  ith  observation  deleted.  Then  the  change  in 
the  amount  of  uncertainty  in  predicting  a  value  of  /3  due  to  the  presence  and  absence  of  the 
ith  observation  is  given  by  the  posterior  entropy  difference 


.(  \X'X\ 

=  -log{\  -  >  0, 


where  hn  is  the  ith  diagonal  element  of  X-i{X'_iX-i)  ^X-i]  see  Poston  (1995)  for  the  proof 
of  the  last  equality.  It  is  well-known  that  0  <  hu  <  1,  thus  A<  is  well  defined.  The  last 
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inequality  indicates  that  the  an  observation  reduces  uncertainty,  hence  is  informative.  The 
influence  is  negligible  when  hu  1. 

AHi  measures  the  influence  of  an  observation  on  the  extent  of  the  collinearity.  Another 
measure  of  influence  on  the  collinearity  is  obtained  using  the  change  in  the  information 
number  of  the  regression  matrix 

A,=A(X..)-A(X)=log{^^y' 

where  k{X)  is  the  condition  number  of  X. 

The  discrimination  information  function  between  the  two  posterior  distributions  is 

2K{nx_r-^x\y,(^^)  =  [Tr{X'X){XjX^i)-^-log\X'X{X./X.i)-^\-p] 

{b-b.iyx'x{b-b.i) 

<r2 

The  second  term  is  the  influence  on  the  posterior  mean  which  is  the  least  square  estimate.  It 
may  also  be  written  in  terms  of  the  fitted  values  of  y.  This  term  when  the  error  variance  is 
estimated  by  the  regression  mean  square  error  equals  to  a  multiple  of  the  Cook  distance.  The 
discrimination  information  is  more  comprehensive  then  the  traditional  diagnostics  because 
of  the  first  term  which  measure  the  influence  of  the  ith  observation  on  the  posterior  variance 
(or  variance  of  the  sampling  distribution). 

Johnson  and  Geisser  (1983)  developed  diagnostics  for  assessing  the  influence  of  observa¬ 
tions  on  predictive  distribution  of  n  new  observations  yff[X]  corresponding  to  the  regression 
matrix  X.  The  predictive  influence  of  a  subset  of  observations  is  measured  by  the  following 
discrimination  information  functions  predictive  densities:  2K\f{yx[X\\y_g)  :  /(2/jv[Ar]|r/)] 
or  2K\f{yx\X]\y)  ;  f{yN[^]\y-s)])  where  y_g  denotes  the  data  exclusive  of  the  subset  under 
consideration. 

For  the  normal  ME  model  (3.4)  with  noninformative  prior,  the  conditional  predictive 
density  is 

fiyN\^]\y^<^^)  =  j  =  ■/v(A’6,y(T2), 

where  V  =  X{X'X)~^X' .  Similarly,  the  conditional  predictive  density  when  the  ith  observa¬ 
tion  is  deleted  is  found  to  be  f{yx[X]\y_i,  a^)  =  N{Xb-i,  Vj(r^)  where  Vi  =  X{X'_iX^i)~^X' . 
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Then  an  influence  measure  of  the  ith  observation  is  given  by 

mf{yN[X]\y-i):f{yN[X]\y)]=  =  [Tr{VVr^)-log\VVr^\-p 

{b-b^iyx'x{b-b_i) 

(72 

The  first  term  quantifies  the  influence  on  the  predictive  covariance.  The  last  term,  when 
estimated  by  the  mean  square  error,  is  proportional  to  the  Cook  distance. 

Johnson  and  Geisser  (1983)  also  developed  influence  diagnostics  for  the  case  of  unknown 
variance.  They  used  Jeffreys  prior  which  gives  the  multivariate  t  distribution  (4.19)  for  the 
predictive  density.  Since  the  discrimination  information  function  between  two  t  densities  does 
not  have  a  closed  form,  the  influence  diagnostics  are  developed  based  on  an  approximation. 
These  authors  extends  their  results  to  the  case  of  normal-gamma  prior  (5.15).  Carlin  and 
Poison  (1991)  extended  this  line  of  work  to  case  of  nonlinear  models. 
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